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Abstract — We consider a wireless system with a small number 
of delay constrained users and a larger number of users without 
delay constraints. We develop a scheduling algorithm that reacts 
to time varying channels and maximizes throughput utility (to 
within a desired proximity), stabilizes all queues, and satisfies 
the delay constraints. The problem is solved by reducing the 
constrained optimization to a set of weighted stochastic shortest 
path problems, which act as natural generalizations of max- 
weight policies to Markov modulated networks. We also present 
approximation results that do not require a-priori statistical 
knowledge, and discuss the additional complexity and delay 
incurred as compared to systems without delay constraints. The 
solution technique is general and applies to other constrained 
stochastic network optimization problems. 



I. Introduction 

This paper considers delay-aware scheduling in a multi- 
user wireless uplink or downlink with K delay-constrained 
users and N delay-unconstrained users, each with different 
transmission channels. The system operates in slotted time 
with normalized slots t G {0, 1,2,.. .}. Every slot, a random 
number of new packets arrive from each user. Packets are 
queued for eventual transmission, and every slot a scheduler 
looks at the queue backlog and the current channel states and 
chooses one channel to serve. The number of packets that are 
transmitted over that channel depends on its current channel 
state. The goal is to stabilize all queues, satisfy average delay 
constraints for the delay-constrained users, and drop as few 
packets as possible. 

Without the delay constraints, this problem is a classical 
opportunistic scheduling problem, and can be solved with 
efficient max-weight algorithms based on Lyapunov drift and 
Lyapunov optimization (see [1] and references therein). The 
delay constraints make the problem a much more complex 
Markov Decision Problem (MDP). While general methods for 
solving MDPs exist (see, for example, [2] [3] [4] [5]), they 
typically suffer from a curse of dimensionality. Specifically, 
the number of queue state vectors grows geometrically in 
the number of queues. Thus, a general problem with many 
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queues has an intractably large state space. This creates non- 
polynomial implementation complexity for offline approaches 
such as linear programming [3] [4], and non-polynomial com- 
plexity and/or learning time for online or quasi online/offline 
approaches such as Q-learning [2] [6]. 

We do not solve this fundamental curse of dimensionality. 
Rather, we avoid this difficulty by focusing on the special 
structure that arises in a wireless network with a relatively 
small number of delay-constrained users (say, K < 5), 
but with an arbitrarily large number of users without delay 
constraints (so that N can be large). This is an important 
scenario, particularly in cases when the number of "best effort" 
users in a network is much larger than the number of delay- 
constrained users. We develop a solution that, on each slot, 
requires a computation that has a complexity that depends 
geometrically in K, but only polynomially in 7Y. Further, the 
resulting convergence times and delays are fully polynomial in 
the total number of queues K + N. Our solution uses a concept 
of forced renewals that introduces a deviation from optimality 
that can be made arbitrarily small with a corresponding 
polynomial tradeoff in convergence time. Finally, we show 
that a simple Robbins-Monro approximation technique can be 
used, without knowledge of the channel or traffic statistics, and 
yields similar performance. Our methods are general and can 
be applied to other MDPs for queueing networks with similar 
structure. 

Related prior work on delay optimality for multi-user op- 
portunistic scheduling under special symmetric assumptions is 
developed in [7] [8] [9], and single-queue delay optimization 
problems are treated in [10] [11] [12] [13] using dynamic 
programming and Markov Decision theory. Optimal asymp- 
totic energy-delay tradeoffs are developed for single queue 
systems in [14], and optimal energy-delay and utility-delay 
tradeoffs for multi-queue systems are treated in [15] [16]. The 
algorithms of [15] [16] have very low complexity and converge 
quickly even for large networks, although the tradeoff-optimal 
delay guarantees they achieve do not necessarily optimize the 
coefficient multiplier in the delay expression. 

Our approach in the present paper treats the MDP problem 
associated with delay constraints using Lyapunov drift and 
Lyapunov optimization theory [1]. We extend the max-weight 
principles for stochastic network optimization to treat Markov- 
modulated networks, where the network costs depend on 
both the control actions taken and the current state (such as 
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the queue state) the system is in. For each cost constraint 
we define a virtual queue, and show that the constrained 
MDP can be solved using Lyapunov drift theory implemented 
over a variable-length frame, where "max-weight" rules are 
replaced with weighted stochastic shortest path problems. This 
is similar to the Lagrange multiplier approaches used in the 
related works [12] [13] that treat power minimization for 
single-queue wireless links with an average delay constraint. 
The work in [12] uses stochastic approximation with a 2- 
timescale argument and a limiting ordinary differential equa- 
tion (ODE). The work in [13] treats a single-queue MTMO 
system using primal-dual updates [17]. Our virtual queues 
are similar to the Lagrange Multiplier updates in [12] [13]. 
However, we treat multi-queue systems, and we use a different 
analytical approach that emphasizes stochastic shortest paths 
over variable length frames. Our resulting algorithm has an 
implementation complexity that grows geometrically in the 
number of delay-constrained queues K, but polynomially in 
the number of delay-unconstrained queues N. Further, we 
obtain polynomial bounds on convergence times and delays. 

The next section describes the general stochastic network 
model and its application to delay constrained wireless sys- 
tems. Section [En] presents the weighted shortest-path algo- 
rithm. Section [IV] describes an approximate implementation 
that does not require a-priori knowledge of channel or traffic 
probabilities. The approximation scheme learns by observing 
past system outcomes, and uses a classic Robbins-Monro 
iteration (see [2]) together with a delayed queue analysis to 
uncorrelate past samples from current queue states. Section [VI 
treats a more general problem of optimizing convex functions 
of time average penalties. 

II. Network Model 

Consider the following abstract model of a stochastic queue- 
ing network (application to delay constrained wireless systems 
is detailed in Section lll-Cl l. The system operates in slotted time 
t E {0, 1,2,.. .}. Let Af represent a finite set of queues to be 
stabilized, and let Q(t) — {Q n {t))n^N denote the vector of 
queue backlogs on slot t. Each queue Q n {t) is assumed to 
have infinite buffer space, and has dynamic update equation: 

Q n (t + 1) = max[Q„(t) - p, n (t),0] + R n {t) (1) 

where and R n (t) are the service rate and new arrivals, 

respectively, for queue n on slot t. Let fi(t) and R(t) be the 
vector of service rates and arrival variables with entries n E Af. 
On each slot t, the vectors fi(t) and R(t) are determined as 
functions of a random outcome fl(t), a state variable z(t), 
and a control action I(t): 

M(t) 4 AW*), «(*).*(«)) 
R(t) 4 R(I(t),Q(t),z(t)) 

Specifically, the control action I(t) is made every slot with 
knowledge of z(t) and Q,{t) (and also Q(t)), and is constrained 
to take values in an abstract set la(t).z{t) that has arbitrary 
cardinality and that possibly depends on Vt{t) and z(t). The 
random outcome il(t) takes values in a set with arbitrary 
cardinality, and represents a collection of network parameters 



(such as channel states) that can randomly change from slot to 
slot. We assume that il(t) is i.i.d. over slots with some fixed 
(but potentially unknown) distribution that does not depend 
on the current state or the past network control actions. The 
state variable z(t) takes values in a finite or countably infinite 
set Z, and represents a controlled Markov chain related to 
the network (this will be used to represent delay-constrained 
queues in the next subsection). The transition probabilities of 
z(t) depend on il(t) and on the control decision I(t). That is, 
for all states y,z E Z, we define P yz {I,U) as follows: 

P yz (I, Q)APr[z{t + 1) = z | z(t) = y, I{t) = I., Q(t) = Q] 

The state space Z is assumed to contain a state that is 
accessible from any state z E Z. In the next sub-section, we 
impose an additional (p-forced renewal assumption, where the 
probability of reaching state from any state z E Z and under 
any f2(i), I(t) is at least </>, for some positive probability cf> > 
(described in more detail in Section Hl-Bl i. 

For each slot t we have a collection of general network 
penalties x m (t) for m € {0,1, ... , M} for some finite integer 
M. These are defined by penalty functions x m (-) that represent 
different types of costs incurred when a control action I(t) is 
taken under outcome fl(t) and state variable z(t): 

x m (i)Ax m (I(i),Sl(t),z(t)) 

The penalty functions are arbitrary and possibly negative 
(negative penalties can represent rewards). However, they are 
assumed to be upper and lower bounded by finite constants 
x™ m and x^ x , so that regardless of I(t), ft(t), z{t) we have: 

rain ^ » / \ < max 
■^ra — A ral / — A m 

Similarly, the fi(-) and R(-) functions are arbitrary but are 
assumed to be bounded by constants [i™ ax and R" Lax : 

< An(-) < vT x > < Rn(-) < RT X Vn E M 

For each penalty m E {0, 1, ... , M}, each queue Q n (t) for 
n E Af, and for a given control policy that makes decisions 
I(t) over time, we define the following time averages: 

x m 4 limsup - E {x m {I(r), fl(-r), z(r))} 

- 1 * _1 

Q n 4 limsup- ^E{Q„(r)} 

We say that a queue Q n (t) is stable if Q n < ooQ We now 
state the stochastic optimization problem of interest. 

Stochastic Optimization Problem: Determine a control pol- 
icy that solves: 

Minimize: xo (2) 

Subject to: x rn < for all m G {1, ... , Af} (3) 
Q n < oo for all n G Af (4) 

where each constant represents a desired constraint on the 
time average of penalty x m {t). 

'This is often called strongly stable as it implies finite average queue 
backlog. 
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The stochastic optimization problem seeks to minimize the 
time average of penalty xq (t) subject to constraints on the time 
averages of all other penalties, and on the stability of all queues 
Quit) for n € J\f, This is similar to the stochastic network 
optimization problems of [1] [18] [19], with the exception that 
the penalty function now includes the state variable z(t) G Z, 
and the transition probabilities for z(t) depend on the control 
action I(t) and the random i.i.d. variable Q(t). We say that the 
problem is a stochastic feasibility problem if we desire only 
to satisfy the time average constraints OJ-©, without regard 
for minimization of Xq. The problem ©-(HI) is generalized 
in Section [V] to treat optimization of convex functions of 
time averages, similar to the objectives considered without the 
Markov modulated state variable z[t) in [1] [20] [21]. 

A. Existence of a Maximizer for Linear Functionals 

We have assumed that the control action I(t) takes values in 
an abstract set 1u(t),z(t) w i m arbitrary cardinality, and that the 
probability transition matrix P yz (I, f2) is an arbitrary function 
of /, and the corresponding functions jj, n (I(t), O(t), z(t)), 
R n (I(t),n(t),z(t)), x m (I(t),n(t),z(t)) are arbitrary but 
bounded. It turns out that our resulting control algorithm will 
involve choosing I(t) to maximize a weighted sum of these 
functions every slot. Hence, it is useful to assume throughout 
the paper that for any given O, z, and for any (possibly 
negative) scalars {a n , f3 n ,j m , S y }, problems of the type: 

Maximize: £ n a„/t„(I, O, z) + Y, n PnR n {I, 0, z) 

Subject to: / £ In.z 

have at least one well defined maximizer /* S Zn,z- This holds 
in most practical situations, such as whenever the set 2q 2 is 
finite for all £1 and z. Alternatively, it also holds whenever 
Zq iZ is a compact subset of R L (for some finite integer L), 
the state space Z is finite (so that the sum Yl y P*v(') nas a 
finite number of terms), and the functions fj. n (-), Rn('), x m (-), 
P Z y(-) are continuous in / for all fl and z. 

B. The Forced Renewal Assumption 

To ensure that the z(t) state variable "renews" itself reg- 
ularly by returning to the state, we consider the following 
simple (and sub-optimal) mechanism. Let Q,{t)=[u>{t)\ <fi(t)], 
where u(t) is the random outcome of network state variables 
(taking values in an abstract set W with arbitrary cardinality), 
and <fi(t) is an independent Bernoulli 0/1 variable that is 
i.i.d. over slots with Pr[<p(t) = 1] = <j>, for some small 
but positive forced renewal probability 4> > 0. If <fi(t) = 1, 
the system experiences a forced renewal event which ensures 
that z(t + 1) = 0. Thus, the transition probabilities have the 
following property for all u> £ W, y £ Z, and / € 1[u-,i],y'- 

P yz {I,Sl)=Q if fl = (u,l) and z ^ 
P y0 {I,Q) = l iffi = (w,l) 

The value of tfi(t) is known to the network controller at the 
beginning of slot t, although if <f>{t) = 1 the renewal itself only 
occurs at the end of slot t when the next state z(i+l) is forced 



to 0. In this way, the control action taken during a slot t in 
which 4>{t) = 1 still affects the queueing and penalty functions 
/z(/(i),Q(i),z(t)), R(I(t),n(t),z(t)), x m (I(t),n(t),z(t)), 
and these functions possibly have different values when <fi(t) = 
versus <j>{t) = 1 (recall that Cl(t) = [u(t); 0(f)], so that the 
functions can also depend on (f>(t)). 

This forced renewal structure implicitly assumes that the 
system can physically reset the state variable z(t) to zero on 
any slot. Further, even if the system has this physical capa- 
bility, it is generally sub-optimal to force such renewals with 
probability <fi every slot. However, for many systems of interest 
(such as the network defined in the next subsection), if is 
small then the optimal performance over systems constrained 
by this (p-forced renewal assumption is close to the optimal 
performance for systems without forced renewals. Throughout 
this paper, we define optimality in terms of systems with en- 
forced renewals, with the understanding that is a small but 
positive value. 

C. Wireless Systems with Delay Constraints 

Consider now the following wireless system that operates 
in discrete time and fits the abstract model defined above. Let 
7V"={1, . . . , N} denote a set of delay-unconstrained queues, 
and let IC={1, . . . , K} represent a set of delay-constrained 
queues. All packets have fixed lengths, and we let Q(t) = 
(Qn(t))\neM and Z(t) = {Z k {t))\ keK , be the vector of integer 
queue lengths in all delay-unconstrained and delay-constrained 
queues, respectively, on slot t. Suppose each delay-constrained 
queue has a finite buffer size B max , and let Z represent the 
state space for Z(t), which has a finite size of (B max + 1) K . To 
emphasize membership in the state space Z (and to simplify 
notation for the transition probabilities), we let z(t) represent 
the vector state Z(t). 

Forced renewals occur according to the i.i.d. Bernoulli 
process cf>(t) with forced renewal probability (j> > 0. If a forced 
renewal occurs on slot t (so that <fr(t) = 1), all data in all 
delay-constrained queues k E JC is dropped at the end of the 
slot, so that z(t + 1) = (where the state G Z represents 
the vector of all zeros). The data in the delay-unconstrained 
queues is not dropped. The maximum drop rate in a queue 
k e JC due to such forced renewals is at most (B max + \k)4> 
drops/slot, where is the rate of new arrivals to queue k. 
The value (B max + Xk)4> can be made arbitrarily small with 
a small choice of (p. 

Let Ai(t) represent the number of new packet arrivals for 
user i £ M U JC on slot t. Let Si(t) be the current channel 
state for user i £ J\f(J JC. Specifically, S'j(t) is a non-negative 
integer that represents the number of packets that can be 
transmitted over channel i on slot t if the channel is selected 
for transmission. Let A(t) and S(t) be vectors of Ai(t) and 
Si(t) components. Assume that the joint vector [A(t),S(t)} 
is i.i.d. over slots (possibly with correlated entries). Arrivals 
and channels are assumed to be bounded by constants A max , 
S m ax, so that Aiit) < Amax and Si(t) < S max for all i and t. 
Every slot the controller observes all channel states and must 
select a single channel (either from the delay-constrained or 
delay-unconstrained queues) to serve. 
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For each finite buffer queue k e JC, the controller makes 
an additional admit/drop decision immediately upon packet 
arrival (subject to the finite buffer constraint). Let R k (t) and 
D k (t) respectively represent the amount of new arrivals added 
on slot t and new packets dropped on slot t, where: 



R k (t) + D k (t) = A k (t) 
The queue update equation is given by: 



(5) 



Z k (t+1) = 



max[Z k (t) - fi k (t), 0] + R k (t) if = 
if 4>(t) = 1 

(6) 

where for each i e JV U JC we have /z, (t) = Si (t) if channel 
i is served on slot t, and Hi(t) = else. 

Let ui(t) = [A(t), S(t)] be a combined system state vari- 
able that captures the random arrivals and channels, and let 
n(t)=[uj(t),(f>(t)]. Let I(t) be a combined control action, 
which indicates which channel i e JCUJV to serve, and how to 
choose the admit/drop variables Rk{t) and D k (t) for k E JC. 
The control action I{t) is made with knowledge of Q,(t) and 
z(t) (and also Q(t)). The constraint set ln(t),z(t) is defined 
to ensure that at most one channel i E JC U JV is served, and 
that packet drops act according to ([Hi-© and satisfy the finite 
buffer constraint Z k (t+l) < B max . Given the Vt(t), z(t), and 
I(t) values, the next-state is deterministically known, so that 
the transition probabilities P zy (I, are either or 1, Finally, 
the queueing dynamics for each delay-unconstrained queue 
n E JV are given by (Q~|i, with arrival and service functions 
given by: 



R n {i(t),n(t),z(t)) = 
fL n (i{t) 7 n{t),z(t)) = 



A n (t) 

S n (t) if n is served on slot t 
otherwise 



Note in this case that the input to each delay-unconstrained 
queue n E JV is the (uncontrolled) random process A n (t). 

D. Example Penalties for Average Congestion and Delay 

To use the framework of abstract penalty functions to 
enforce an average congestion bound on queue Z k {t) (for a 
given k E JC), we can define a penalty function of the form: 

x k (I(t),n(t),z(t))=Z k (t) 

This penalty function does not use the I(t) or f2(i) arguments, 
and uses the fact that Z k (t) is a component of the z(t) state 
variable. Enforcing a constraint of the type x k < x% v ensures 
that average queue congestion is no more than x k v . 

To enforce a constraint on the time average rate of dropping 
packets in a delay-constrained queue k E JC, we can define a 
penalty function of the form: 

A k {t)-R k {t) if </>(<) = 



x k (I(t),n(t),z(t)) = 



A h (t) + Z k (t) - pL k (t) if </>(<) = ! 



where jl k {t)= min[iJ, k (t), Z k (t)] and represents the number of 
packets served in queue k on slot t. In this case, the penalty 
is equal to the exact amount of packet drops in queue k on 
slot t, so that ensuring x k < x%" enforces a constraint on the 
time average rate of packet drops. Defining a penalty function 
Xq(-) as a (possibly weighted) sum of packet drops in all of 



the delay-constrained queues k E JC allows for minimization 
of a weighted sum of packet drop rates subject to additional 
desired constraints. 

Finally, to enforce a constraint that the average delay of 
(non-dropped) packets in a queue k E JC is less than or equal 
to some desired bound W k v (where W k v is a given constant), 
we can use a penalty function of the form: 

x k (i(t), n(t),z(t)) = z k {t) - Afc(*)Wfc° 

and enforce the constraint x k < 0. Assuming time average 
limits are well defined, this ensures that 



AfeWf < 



(7) 



where is the time average rate of actual packets served 
in queue k (and is also the time average rate of non-dropped 
packets that are admitted). By Little's Theorem [22], we have: 

~Zk = ~\kW k 

where W k is the average delay of non-dropped packets in 
queue k, and so from (O we deduce that W k < W£ v 
(assuming that > 0). 

E. Slackness Assumptions 

Consider the general stochastic queueing network model, 
and let J\A={1, . . . , M} represent the set of penalties involved 
in the feasibility constraints (|3]i-(|4|- 

Definition 1: A control policy I(t) is a (z, Vl)-only policy 
if it satisfies the ^-forced renewal assumption and it makes 
stationary and possibly randomized control actions I(t) E 
In{t),z{t) for eacn slot t based only on the current f2(t) and 
z(t) (and hence independently of Q(t)). 

Suppose there exists a (z, f2)-only policy I*(t) that satisfies 
the feasibility constraints (O-©. Let z*(t) represent the result- 
ing network state variable under this policy, and note it evolves 
according to an irreducible finite or countably infinite state 
Markov chain. Hence, time average limits are well defined 
[22]. Let x* m , Ji* n , r* respectively represent the time average 
of penalty x m (t), transmission fi n (t), and admission A n (t), 
under policy I*(t). Because queue stability requires the time 
average arrival rate to be less than or equal to the time average 
service rate, it is easy to show that (O-© imply: 



x* m < C for all me M 
Mn ~ ¥ n > for all n e JV 



(8) 
(9) 



Let Xq P represent the infimum value of xq over all (z, fi)- 
only policies that satisfy ©-(|9|. We shall measure optimality 
of our algorithm designs with respect to XQ Pt . This is typically 
non-restrictive. For example, if Z has a finite state space, it 
can be shown that the infimum of xq over all policies that 
satisfy ©-© is equal to Xq P *0 

Assume that z(0) = 0, and define renewal events as times 
{t g }^L , starting with to = 0, where each renewal event t g 
occurs when z{t g ) = and some other criterion is met (as 

2 This can be shown by well known optimality of stationary randomized 
policies for MDP problems over finite state spaces [4] and for queue stability 
problems [18], although the formal proof is omitted for brevity. 
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described below). Define a renewal interval as the duration of 
time between successive renewal events (including the starting 
renewal slot but not including the ending renewal slot). Define 
T g as the size of the gth renewal interval (also called the 
inter-renewal time). The additional criterion that defines a 
renewal can be anything that satisfies the following renewal 
requirements: 

• There are finite constants mi and m 2 such that under 
any policy for choosing I(t) over time, the inter-renewal 
times have first and second moments upper bounded by 
mi and ni2, respectively, regardless of past history. 

• Under any (z, fi)-only policy I*(t), the inter-renewal 
times are independent and identically distributed (i.i.d.), 
as are the sequences of decisions and penalties incurred 
over different renewal intervals. 

Note that these requirements are met whenever the system 
"resets" itself at renewal events, so that under a particular 
(z, f2)-only policy the system has independent but identically 
distributed behavior on each renewal interval. By basic renewal 
theory, all time average penalties have well defined limits that 
are exactly equal to the expected sum penalty over a renewal 
interval divided by the expected duration of the renewal 
interval [23]. 

For our purposes, we focus on the following three dif- 
ferent examples of renewal definitions. Systems with "type- 
1 renewals" have renewals defined by any slot t g at which 
z(t g ) = 0. Note that a type-1 renewal may arise either because 
of a forced renewal, or because of controller decisions that 
lead to the z(t) = state. Hence, the average duration 
of a type-1 renewal interval is less than or equal to \/4>. 
Alternatively, "type-2 renewals" are defined only by forced 
renewal events, and hence have average size exactly equal 
to 1/0. Finally, "type-3 renewals" are defined by every &th 
visitation to the z{t) — state, where b is a given positive 
integer. Thus, the average duration of a type-3 renewal is less 
than or equal to b/(f>. Note that all three definitions meet the 
renewal requirements specified above. 

Consider any valid renewal definition for the network (such 
as a type-1, type-2, or type-3 definition). Suppose that z(Q) = 
and that time is a renewal time. Define T* as the random 
time until the next renewal event under the (z, £l)-only policy 
I*(t) that satisfies dS)-®. We have by basic renewal theory: 



E 



E{T*} 

EfErlo 1 ^*^),^),^))} 

E{T*} 



Mm e M 



where d n (I, £1, z) is defined: 

d n (i, n, z)Afi n (i, n, z) - ii n (i, n, z) 

In addition to assuming the feasibility constraints I©-© are 
satisfied, we make the following two mild assumptions. The 
first is a slackness assumption that is a stochastic analogue of 
a Slater condition for static optimization problems [17]. 

Assumption 1: (Slackness of Feasibility) There exists a value 
e > such that the constraints of dHJ-© can be met with e 



slackness. Specifically, there exists a (z,S!)-only policy I* it) 
that satisfies the following for all m 6 A4 and n E JV: 



E 



{Er=0 lj: m( r )} 

E{T*} 



■ e 



(10) 



(11) 



E {T*} 

where for notational simplicity we have defined: 

x* m (r) 4 x m (r(r),n(r),z*(r)) 
<(r) 4 d n (I*(T)Mr),z*(r)) 

Assumption 2: (Optimization) There exists a (z, Q)-only 
policy I* (t) (not necessarily the same policy as in Assumption 
1) that satisfies: 



E{Egj(j} 
E{T*} 

e{C _1 4(t)} 
E {T*} 

E {T*} 



opt 



< x a Z VmeM 



> Vn G Af 



(12) 



(13) 



(14) 



where T* is the size of the first renewal interval, and z*(t) 
is the network state at time r under policy I*(t). 

Assumption 1 states that there is a (z, f2)-only policy 
that satisfies all feasibility constraints (and queue stability 
constraints) with e slackness. Assumption 2 states that there 
is another, typically different, (z, £!)-only policy that achieves 
the desired optimum value Xy pt while satisfying the feasibility 
constraints (possibly with no slackness in these constraints)!^ 

III. The Dynamic Control Algorithm 

To solve the stochastic feasibility and stochastic optimiza- 
tion problems for our queueing network, we extend the frame- 
work of [1] to a case of variable length frames. Specifically, 
for each time average penalty constraint (O, parameterized by 
m e A4={1, . . . , M}, we define a virtual queue Y m (t) that 
is initialized to zero and has dynamic update equation: 



Y m (t + 1) = max[F m (t) - x™ + x m (t),0] 



(15) 



where x m (t) — x m (I(i),Q(t), z(t)) is the penalty incurred 
on slot t by a particular choice of the control decision I(t) 
(under the observed Q(t) and z(t)). The intuition is that if the 
virtual queue Y m (t) is stable, then the time average rate of 
the "input process" x m (t) is less than or equal to the "service 
rate" x%% [18]. This turns the time average constraint into a 
simple queue stability problem0 

3 In the case when the infimum value Xq P is only achievable over a limit 
of an infinite sequence of (z, f2)-only policies, we can replace XQ Pt with 
"' , e where X n chleve is any achievable value, and then recover the 



• w,lclc J-o 

results of Theorem |2] by taking a limit 



c 



4 Note that Y m (t) can be viewed as a "generalized" queue, as the "service 
rate" can be negative, as can the x m (f) value. 



6 



A. Lyapunov Drift 

Define Y{t) as a vector of all virtual queues Y m (t) for 
meM, and define ®(t) = [Y(t);Q(t)] as the combined 
queue vector. Assume all queues are initially empty, so that 
0(0) = 0. Define the following quadratic Lyapunov function: 

L(0(*))Al^Q n(i )2 + l ^ Ym{f)2 

Suppose time t g is a renewal event (for any valid renewal 
definition), and let T be the random time until the next renewal 
event (which may depend on the control policy, such as when 
type-1 renewals are used). Define the variable-slot conditional 
Lyapunov drift AT(®(t g )) as follows^ 

A T (0(i g ))A 

E{L(®(t g + r)) - L(®(t g j) | ®(t g ),z(t g ) = 0} (16) 

The expectation in the drift definition above is with respect 
to the random renewal interval duration T, the random events 
that can take place over this interval, and the possibly random 
control actions I(t) that are made during this interval. The 
explicit conditioning on z(t g ) = in ( PToT i will be suppressed 
in the remainder of this paper, as this conditioning is implied 
given that t g is a renewal time. 

It is important to note the following subtlety: The renewal 
events under a given policy I(t) may arise from the decisions 
made under the policy (as in type-1 or type-3 definitions), 
although the implemented policy I(t) may not be stationary 
and/or may depend on the queue values Q(t), and so actual 
system events are not necessarily i.i.d. over different renewals. 
Therefore, these "renewal-events" do not necessarily reset the 
system dynamics of the actual system. However, these times 
act as convenient "time-stamps" over which to analytically 
compare the Lyapunov drift of the actual implemented policy 
with the corresponding drifts of the (z, il)-only policies of 
Assumptions 1 and 2. 

Lemma 1: (Lyapunov Drift) Under any network control 
policy for choosing I(t) over time, and for any renewal 
definition that meets the renewal requirements, the variable- 
slot conditional Lyapunov drift satisfies the following at any 
renewal time t g and any 0(t 9 ): 

A T (©(t 9 )) < B + D(@(t g )) (17) 

where D(®(t g )) is defined: 

£>(»(*,))= - Ene* Qn(t g )E {E^o dn(t g + t) \ ®{t g )) 
- J2meM Y m(tg)E {TX% - Er=0 ^(tg + t) | 0(t fl )}(18) 

where we recall that: 

d n (t)Ad n (I(t),tl(t),z(t)) , x rn {t)Ax rn (I(t),n(t),z(t)) 



where a 2 is a constant that satisfies the following for all t: 

° 2 > £ fa (t) 2 + R n (t) 2 ] + £ (x m (t) - O 2 d9) 

Note that er 2 is finite due to the finite bounds on the penalties 
x m (t) and on the queue variables (x n {t) and R n (t), and B is 
finite and bounded by 77i2<7 2 /2 by the second moment property 
of the renewal requirements in Section Hi-El Further, for type- 
1 or type-2 renewals, the second moment of T is bounded 
by the corresponding second moment of a geometric random 
variable (with success probability so that for all ®(t g ): 

E{T 2 |0(t 3 )}<(2-^)/0 2 

For type-3 renewals the second moment is bounded by that of 
a sum of b independent geometric random variables: 

E{T 2 | ®(t g )} < 6(1 - 0)/V> 2 + & 2 /^ 2 
Proof: (Lemma[T|i The proof follows by squaring the queue 
update equations (fl]i and (fT~5T > and using a multi-slot drift 
analysis. See Appendix A for the full proof. □ 
Let V > be a non-negative parameter that we shall use 
to affect proximity to the optimal solution (with a tradeoff in 
convergence times and average queue congestion, as shown 
below). For pure feasibility problems, we set V — 0. As in 
[1] [18] for the case of single-slot problems, our strategy in 
this variable slot scenario is, upon every renewal event, to 
take control actions that minimize the following "drift-plus- 
penalty" expression: 

£(0(t s )) + ^E j^zo^+r)! ©(i ff )j (20) 

where T is the random time until the next renewal event. 

Given the queue backlogs ®(t g ) at the start of the renewal 
time t g , the expression (l20b represents a sum of random drift 
and penalty terms (which depend on control actions) over the 
course of a renewal interval. Hence, controlling the system to 
minimize this sum amounts to solving a weighted stochastic 
shortest path problem over the renewal interval (see [2] for a 
treatment of the theory of stochastic shortest path problems). 
This generalizes the well known max- weight policies of [1] 
[18] [19] [24]. Indeed, in [1] [18] [19] [24] there is no z(t) 
state and so "renewals" occur every slot and the shortest 
path problem reduces to a simple greedy control action that 
minimizes a weighted drift-plus-penalty term over one slot. 
In this generalization, the queue backlogs still act as weights, 
but the solution of the stochastic shortest path problem is not 
greedy and requires consideration of how an action at time t 
affects the Markov state z(t) in future slots t > t g . 



and where B is a finite constant defined: 

rP- B. Feasibility Problems (V = 0) 
Bt— sup E{T 2 | ®(tg)\ 

2 &(tg) Suppose that we have a pure feasibility problem. Define 

V = 0, and define renewals according to any valid definition 

5 Note that proper notation for the drift should be A T (0(t s ), t g ), as the (such as the definition). Suppose that there are constants 

drift may result from a non-stationary policy and hence can depend on the _ . " 

starting time t g , although we use the simpler notation A T (&(t g ) ) as a formal ^ > and d > such that upon every renewal time t g , we 

representation of the right hand side of (16). observe the queue backlogs ®(t g ) and make control decisions 
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over the course of the renewal interval that satisfy: 



the stochastic shortest path policy). Thus: 



D(@(t g )) < D s *v(®(t g ))+C + 5j2Qn(t g ) 

+§ E Y ^ 9 ) 



(21) 



meM 



where D ssp (&(t g )) denotes the value of the expression (|20T > 
under the optimal solution to the stochastic shortest path 
problem (which would take place over a random renewal 
duration T ssp that depends on the random events that would 
occur under this solution), and where D(®{t g )) denotes the 
corresponding value of (|20l > under the actual control actions 
taken (with a random renewal duration T that depends on 
the actual random events that occur). Note that if the exact 
stochastic shortest path solution is implemented every renewal 
interval, we have C — and 6 = 0, 

Theorem 1: (Performance for Feasibility Problems) Sup- 
pose Assumption 1 is satisfied for a given e > 0. If there 
are constants C > and 8 > such that d2"TT i is satisfied for 
every renewal interval, and if S is small enough so that: 



eK{T*} > S 



(22) 



where E {T*} is the expected renewal duration under the pol- 
icy I*(t) from Assumption 1, then all queues Q n (t) and Y m (i) 
are strongly stable, in the sense that Q n < oo and Y m < oo 
for all n E M and m E M.. Consequently, all time average 
feasibility constraints (O-© are satisfied. Furthermore, define 
t g as the timeslot of the 5th renewal (for g E {0, 1,2,.. .} and 
to = 0). Then the time average expectation of queue backlogs 
over the first G renewal intervals satisfies (for any integer 
G > 0): 



1 G_1 



Y^{Qn(t g )}+ £ E{Y m (t g )} 



9=0 LnSAA 



meM 



< 



B + C 
eE{T*} - 5 



(23) 



where we recall that for type-1 renewals E{T*} satisfies 1 < 
E {T*} < 1/0, and for type-2 renewals E{T*} = 1/0. 

Proof: We first prove (l23l . From ( TPTI i and (f2lT) we have 
for any renewal time t g : 

A T (®(t g )) <B + C + D ss P(&(t g )) + 6 E Qn(t g ) 

neAf 

+5 J2 Y m(tg) 

meM 

However, by definition, the stochastic shortest path policy 
yields a value of D ssp (®(t g )) that is less than or equal to 
the corresponding value under any other policy, and hence: 

D ssp (®(t g ))<D*(®(t g )) 

where D*(®(t g )) represents the value under any other policy 
that could be implemented over a renewal interval that starts 
with queue backlogs ®(t g ) (where the renewal interval for 
this other policy may have a different duration than that of 



A T (@(t g )) 



< 



B + C + D*(@(t g )) 
+5j2Qn(t g ) + 6 E Y M 

riGJV meM 

B + C 

T'-l ~j 

5+ E 



- E Y m (t g )E\-5 + T*x%- E 



(24) 



meM 



T = 



where the final equality holds by definition of D*(@(t g )) in 
([T8| >. Now consider the (z, J7)-only policy I*(t) of Assumption 
1, which satisfies inequalities (fT0b-(fm> for some value e > 
0, and which also makes decisions independent of ®(t g ). 
Plugging (|T0j»-(ITT]> into (fJUl yields: 



Ar(e(t fl )) < B + C-(eE{T*}-S)Y,Qn(tg 



2eAf 



-(eE {T*} — S) E Y m (t g 



(25) 



meM 



The above holds for any </ € {0,1,2,...}. Taking expectations 
of both sides of d25l > and using the definition of A(®(t g )) 
given in ( TToT ) yields: 



E{L(®(t g+1 ))}-E{L(®(t g ))} 
-(eE{T*}-S) 



•_ B + C 



lEAf meM 

Summing the above inequality over g E {0, 1, . . . , G — 1} 
and using the fact that all queues are initially empty (so that 
L(©(0)) = 0), and dividing by G yields: 



■{L(Q(t G ))} 



G 

eE{T*}-S 
G 



< (B + C) 

G-l 



E 

g=0 



EE{Q„(* 9 )}+ E E i Y m(t a )} 
neM meM 



Rearranging terms and using non-negativity of L(®(tc)) 
together with the fact that eE {T*} — 6 > (by the assumption 
in d22])) yields the result of ( T23l . 

The fact that the queues Q n (t) and Y m (t) are strongly stable 
follows as a simple consequence of d23l together with the facts 
that (i) queue backlog growth is deterministically bounded 
every slot, and (ii) first and second moments of renewal 
times are bounded. This is formally shown in Appendix B 
for completeness. □ 

Note that type-1 renewals are the best for feasibility prob- 
lems, as they have the smallest renewal duration and thus have 
smaller values of the B constant (and hence smaller average 
sizes for queues Q n (t) and Y m (t)). Indeed, the definition of 
B and the bound on E {T 2 | ®(t g )} given in Lemma [TJ imply 
the following for type-1 renewals: 



B < 



(2 - 0)<r 2 
202 



(26) 



x 



This is compared to the type-3 renewals which consist of b > 1 
type-1 renewals and yield: 



B < 



(6(1 -(/>) + b 2 )a 2 
20 2 



(27) 



However, type-3 renewals are useful in cases when the 
shortest path problem is solved using online approximation 
techniques based on forward simulation, such as the Q- 
learning algorithms in [2], which require a longer time to con- 
verge to a solution that satisfies the approximation bound < f2Tb . 
Using b > 1 in this way can be viewed as a kind of 2-timescale 
approach, where b is proportional to the timescale required 
for accuracy of the stochastic shortest path approximation. 
One difficulty with this approach is that the value b required 
to obtain an accurate approximation may be geometric in 
K, which then creates an exponential bound on B in dTTI i. 
An alternative (single-timescale) approximation technique that 
does not require b > 1 and that preserves the polynomial 
bound of d26l ) associated with type-1 renewals (i.e., b = 1) is 
provided in Section HVl 



C. Optimization Problems (V > 0) 

Consider now the optimization problem (O-©, so that we 
desire to minimize the time average of the penalty Xo(t), and 
the V parameter in the stochastic shortest path problem ( |20t is 
positive. Further suppose that our renewals are defined as type- 
2 renewals, so that renewal events are only at forced renewal 
times, which are i.i.d. Bernoulli with probability 0. Suppose 
that there are constants C > and S > such that on every 
renewal interval we observe the queue states &(t g ) and take 
actions that satisfy the following approximation: 

£>(e(t,)) + VE j^zofo+T)! 0(i 5 )j < 
D° sp (®(t g )) + Ve\J2 x s Q sp (t g + r) | &(t g )\ 



. T=0 



-C + 8J2 Qn{tg) + S Y ^9) + VS 

neW meM 



(28) 



where Xo(i) represents the penalty that is incurred by the 
implemented policy, XQ Sp (t) is the penalty that would be 
incurred under the stochastic shortest path solution to (1201 . and 
T is the renewal frame size (which is unaffected by control 
decisions for type-2 renewals and satisfies E {T} = 1/0). Note 
that if the exact stochastic shortest path solution to (|20T i is used 
every renewal interval, we have C = 6 = 0. 

Theorem 2: (Performance for Optimization Problems) Sup- 
pose we use type-2 renewals (where all renewals are forced 
renewals and occur i.i.d. with probability 0), and suppose 
Assumptions 1 and 2 hold for a given e > 0. Fix a parameter 
V > 0. If there are constants C > and 5 > such that 
(|28T > is satisfied for every renewal interval, and if 6 is small 
enough so that e > 0<5, then Q n < oo and Y m < oo for all 
n £ J\f and meM (and consequently feasibility constraints 
©-(HI are satisfied). Furthermore, for renewal times t g (for 



g G {0, 1,2,.. .}) and for any positive integer G we have: 

G-l 



G 



9=0 



< 



neAf meM 

(B + C)(j> + V(<t>6 + x^ ax - x™ m ) 



Finally, the time average penalty satisfies: 

limsupt^ \ X)*~o E {x {t)} < x° pt -) 
<M[1 + (x^ ax - x° pt )/t 



{B+C)4> 
V 



(29) 



(30) 



and the right-hand side is also a bound on the average penalty 
over G renewal intervals divided by the average duration of 
G renewal intervals, for any positive G (see equation dSTT l in 
Appendix C). 

Note from (l30i > and d29l ) that the time average of Xo(t) 
can be made arbitrarily close to (or below) Xq P * + (f>S[l + 
(x™ ax — XQ Pt )/e] as V is increased, with a tradeoff in average 
queue size that is linear in V. The value 5 determines how 
close this performance is to the optimal value Xq P1 . In the 
case 5 = (which holds, for example, if our approximation 
to the stochastic shortest path problem differs from the op- 
timal solution only by a constant C that is independent of 
queue length), then the V parameter affects a [0(1/ V); O(V)] 
performance-delay tradeoff, as in [1], so that distance to 
optimality is 0(1/V) and hence can be made arbitrarily 
small, at the expense of an increase in the average backlog 
of the queues that is linear in V. This average backlog of the 
queues Q(t) directly affects their average delay (via Little's 
Theorem), while the average backlog of the virtual queues 
Y m (t) affects the average convergence time required to achieve 
the performance guarantees (see also, for example, [25]). 
Proof: See Appendix C. □ 
Only type-2 renewals are considered in Theorem |2] because 
E {T} = E{T*} = 1/0 for type-2 renewal intervals, a 
property which is needed for d30l >. 

IV. Solving the Stochastic Shortest Path Problem 

Consider now the stochastic shortest path problem given by 
expression (l20b . Here we describe its solution, using either a 
type-1 or type-2 renewal definition (so that renewals always 
occur at forced renewal events, and have mean duration at 
most 1/0). Without loss of generality, assume we start at time 
and have (possibly non-zero) backlogs = 0(0). Let T be 
the renewal interval size. For every step r £ {0, . . . , T — 1}, 
define c©(/(r), 0,(t), z(t)) as the incurred cost assuming that 
the queue state at the beginning of the renewal is 0(0): 



c & (I(t)Mt),z(t))A- Qn(0)dn(I(T)Mr),z(r)) 
%eN 



- ]T Y m (0)K:-£m(I(r)Mr),z(r))} 

meM 

+V r i (/(r),n(r),*(r)) 



(31) 



Let I 3SP (t) denote the optimal control action on slot r for 
solving the stochastic shortest path problem, given that the 
controller first observes 0(r) and z(r). Define Z r =Z U {r}, 
where we have added a new state "r" to represent the renewal 
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state, which is the termination state of the stochastic shortest 
path problem. Appropriately adjust the probability transition 
matrix P = (P zy (I, fl)) to account for this new state [26] [2]. 
For example, for type-1 renewals, all transition probabilities 
are the same as before with the exception that transitions 
to state are replaced with transitions to r. Define J = 
(Jz)\zez r as a vector of optimal costs, where J z is the 
minimum expected sum cost to the renewal state given that we 
start in state z, and J r — 0. By basic dynamic programming 
theory [26] [2], the optimal control action on each slot t (given 
J7(t) and z(t)) is: 

I(t) = argmin /e i n(T) Mt) [c&(I, fi(r), z(t))+ 

J2 v£ z r P Z (r)JlMr))Jy} (32) 

This policy is easily implemented provided that the J z 
values are known. It is well known that the J vector satisfies 
the following vector dynamic programming equation^ 



E<! min [c & (I,Q.(T))+P(I,n(r))J] 



(33) 



where we have used an entry-wise min (possibly with different 
/ vectors being used for minimizing each entry z € Z). Thus, 
the notation I E ^n(t),z emphasizes that for a given z E Z, 
the control action I is chosen from the set I^t), z - Further, 
c©(7, fl(t)) is defined as a vector with entries c Zj ©(/, = 
c&(I,Q(t),z), and P(I, tt(t)) = (P zy {I,Q)) is the proba- 
bility transition matrix under Sl(i) and control action I. The 
expectation in d33l is over the distribution of the i.i.d. process 
il(t). Because tt(t) has the structure fl(t) — [u)(t);<f>(t)} 
where oj(t) is the random outcome for slot t and <fi(t) is an 
independent Bernoulli process that has forced renewals with 
probability <fi, we can re-write the above vector equation as: 



J = cbE 



min c^\l,u)(t)) 

/eX[ W ( t ),i],» 



(1 - 0)E 
where: 



•m.o], 



3^ 0) (/,c(i))+P (0) U^W)J 



(34) 



p(°\i,u>(t)) 



= ce(/,[w(*),l]) 
= ce(/,[w(t),0]) 



We assume that the probability transition matrix 
p(°) (7, Lu(t)) is known (recall that this is indeed a known 
0/1 matrix in the case of the system with delay-constrained 
and delay-unconstrained users of Section [Tl-Cb . We next show 
how to compute an approximation of J based on random 
samples of u>(t) and using a classic Robbins-Monro iteration. 

A. Estimation Through Random i.i.d. Samples 

Suppose we have an infinite sequence of random variables 
arranged in batches with batch size L, with uiu denoting 
the zth sample of batch b. All random variables are i.i.d. 

6 One can also derive (33) by defining a value function H(z, O), writing the 
Bellman equation in terms of H(z(t+l), Q(t+1)), taking an expectation with 
respect to the i.i.d. Q(t), fi(t + l), and defining J(z)=E n ^ t - ) {H{z, f2(t))}. 



with probability distribution the same as uj{t), and all are 
independent of the queue state that is used for this stochastic 
shortest path problem. Consider the following two mappings 
^ and ^ from a J vector to another J vector, where the 
second is implemented with respect to a particular batch b: 



(1 - 0)1 



mm c, 



mm 



& } ('><■>(*)) 



i 0) (/, W (t)) + p(°'(/,c(t))J 



(35) 



1 L 



mm 



-© 



(I,u>bi) + 



(i-^)t 



»=i 

1 L 

> min 



{I,uu) + P i0) (I,Ubi)J 



(36) 



where the min is entrywise over each vector entry. The 
expectation in ( |35l l is implicitly conditioned on a given 
vector, and is with respect to the random u>(t) event that is 
independent of 0. The mapping <!/ cannot be implemented 
without knowledge of the distribution of u>(t) (so that the 
expectation can be computed), whereas the mapping ^ can 
be implemented as a "simulation" over the L random samples 
Lou (assuming such samples can be generated or obtained). 
Note however that the expected value of \f J is exactly equal 
to ^ J. Thus, given an initial vector Jf, for use for step b (with 
some initial guess for Jo, such as Jo = 0), we can write 
^Jh = W Jb + Vb> where rj b is a zero-mean vector random 
variable. Specifically, the vector r/ b satisfies: 

E{r 1b \J b } = 

Thus, while the vector T] b is not independent of Jb, each entry 
is uncorrected with any deterministic function of Jb- That is, 
for each entry i and any deterministic function /(•) we have 
via iterated expectations: 

E {Vb[i]f(Jb)} = E {/(Jfc)E {r) b [i\ I Jb}} = (37) 
For b E {0, 1,2,.. .} we have the iteration: 
1 



o+l o+l 
This iteration is a classic Robbins-Monro stochastic approxi- 
mation algorithm. It can be shown that the J vector remains 
deterministically bounded for all b (see Lemma |2] below), and 
that ^ and ^ satisfy the requirements of Proposition 4.6 in 
Section 4.3.4 of [2]. Thus the above iteration is in the standard 
form for stochastic approximation theory, and ensures that: 

lim J b = J* with prob. 1 

b — >oo 

where J* is the cost vector associated with the optimal 
stochastic shortest path problem, that is, it is the solution 
to d34l l and thus satisfies J* = 'I' J*. This holds for any 
batch size L (including the simplest case L = 1), although 
taking larger batches may improve overall convergence as the 
variance of the per-batch estimation is lower. 

However, because our estimates do not need to converge to 
the exact value of J*, we modify the iteration ( f38l > as follows: 



J b+ i = i^J b + (1-7) J 6 



(39) 
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where 7 is a value such that < 7 < 1, chosen to be suitably 
small to provide an accurate approximation, as specified below. 
We first define a norm for random vectors. 

Definition 2: Let X = (Xj) be a random vector. We define 
the entrywise root expected square norm (or e-norm) ||-X"|| e 
and its deterministic version ll-XIL as follows: 



max 



E{X 2 } 



\\X\\ d = max I JO I 
3 

It is easy to verify that for any random vector X we 
have ||X|| e > 0, ||JT|| e = ||-X|| e , ||aX|| e = a||X|| e 
for any non-negative scalar a, and the triangle inequality 
||Jf + *H|e < ||-X"||e + H^He holds for all random vectors 
X and Y with the same dimension. If X is a deterministic 
constant vector, then ||JC||d = ||-X"|| e . 

Now consider the iteration ( 13 91 . We want to show that the 
noise vectors {^j^o me bounded for all 6. To this end, let 
Cmax denote the maximum absolute value of any entry of the 
c'q (/, Lo{t)) and (I, uj(t)) vectors, considering all possible 
w(t) and /. Note that this value is finite due to the finiteness 
of the penalty functions. The size of c max grows linearly in 
V and in the size of queue backlogs 0. 

Lemma 2: Define J max =Cmaxl4>- if H^olU < Jmax, then: 

(a) For all 6 <E {0, 1,2,.. .} we have: 



U 



b d 



< Jn 



(b) There are finite constants r\ m i n and r\ ma x such that 
Vmin _ Vb [i] < Vmax for all iterations b and all entries i. 
Further, for all 6, if batches of size L > 1 are used, then: 



< 



IVminVmax] , 4(c 



< 



(1 - <t>)J maX ? 



Proof: See Appendix E. □ 
Lemma 3: Let J be the 6th iteration of (1391 . starting with 
some initial vector Jo with ||Jo|| e _ Jmax- Assume that for 
all b we have ||j7f,|| 2 _ c 2 for some finite constant a 2 (as in 
Lemma [2] with a 2 = \r) m irJ]max\l '£)■ Let J* be the optimal 
solution to (I341l . satisfying ^J* = J*. Then: 

(a) Every one-step iteration satisfies (for integers b > 0): 



\\Jb+i-r\\ 2 e <(i-H) 2 \\Jb-r 

(b) After b iterations we have: 



\l+l 2 \\rib\\l 



< 



{l-^) 2b \\J -J*\\l 
7 a 2 (l-(l-07) 2b ) 

0(2 - <tn) 



(c) In the limit, we have: 



lim \\J b 



< 



7(T 



b^oo " 0(2 — (jyy) 

Proof: See Appendix E. □ 
Part (c) shows that the limiting deviation from the desired 
J* vector can be made as small as desired by choosing a 
suitably small value of 7 (and a suitably large value of b). 
We now show that an implementation that chooses I(t) over 
a frame according to (l32l . using the J\, estimate instead of the 
optimal J* vector, results in an approximation to the stochastic 



shortest path problem that deviates by an amount that depends 
on \\J b - J*\\ e . 

Lemma 4: Suppose we choose I{t) according to d32l over 
the course of a frame, using a (possibly random) vector J 
rather than J*. Define J(J) as the vector of expected sum 
costs over the frame (given this implementation uses J). Then: 

mj)-j-\u< 2{1 - m f- J " n - 

<P 

Further, defining 1 as a vector of all 1 entries with size equal 
to the dimension of J, we have: 

2{\-4>)\\j -r\\ e 



;{j(j)} 



< j* 



where the expectation above is with respect to the randomness 
of the J vector. 

Proof: See Appendix E □ 

B. Choosing b and 7 

Lemma [3] can be used to compute the C and S constants in 
Theorems Q] and [2] For example, if we start the iterations with 
J = 0, and note that ||J*|| e < J ma x = c max /(p, we find 
from part (b) of Lemma [3] that the main error term decreases 
exponentially fast (in b), so that: 



< 



'(1 



1 \oh^max 



7 (T 2 (1 - (1 - 7 ) 2fc ) 



0(2 - 7 ) 



(40) 



Now choose b so that: 

r 2 

\2b ^rn ax 



(1 - 07)' 



7(7 2 (1 -07) 



2!. 



< 



7<7 



0(2 - 7 ) " 0(2 - 7 ) 



which implies from d40b that: 



\Jb-r\i < 



2 7 cr 2 



0(2 - 7 ) 



(41) 



This is equivalent to choosing the integer b as follows: If 

Crnax/^ 2 — 7°' 2 /(0(2 ~ 07)) then choose 6 = 0. Else, choose 
6 such that: 



6> 



log( 



(42) 



21og(l/(l-0 7 )) 
With this choice of 6, combining (HTb with Lemma [4] shows 
that the expected cost over the duration of the frame when we 
use the J vector (rather than the optimal J* vector) differs 
by no more than a constant a, where: 



aA 2(l-0)v/27^ < 2(1-0)^27^ 
0^(2 - 7 ) " 0^0(2 -0) 

Now note from OTb that c max grows at most linearly in V 
and ||©||(j- Hence (by Lemma|2]3 and LemmaO, r\ m in, Vmax, 
and V& 2 also grow at most linearly and so: 

V^ 2 < d L max[||0|| d ,V] 

for some proportionality constant d^ that depends on the batch 
size L and the maximum penalties. Thus, for a given desired 
8 > 0, choosing 7 to satisfy: 

<5 2 3 (2 - 0)' 



< 7 < min 



1, 
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ensures that: 

a < max[||©|| dl V]5 

Thus, inequalities (f2Tb and (f28b can be satisfied for this 5 
and choosing (7 = 0. This can be seen by simply placing the 
approximation error on Q n (t g )8, Y m (t g )8, or V8, depending 
on which term has the largest value. The above bound shows 
that 7 must be chosen quite small, which requires a large num- 
ber of iterations b according to ( f42l > for provably high accuracy 
of the algorithm. However, the resulting b is still polynomial 
in the system parameters and in 1/8, demonstrating that we 
need only a polynomial number of samples for performance to 
be arbitrarily close to the optimal with polynomial delay, and 
this is independent of the size of the state space Z. In practice, 
one may not require such a small value of 7 or a large value 
of b. Further, rather than starting the iterations with Jo = 0, 
performance can be significantly improved if we initiate the 
iterations on the current frame using the end value of Jb from 
the previous frame. The intuition is that the queue backlogs 
do not change significantly over the course of the frame, and 
hence the previous calculations can be exploited. 

C. Complexity Discussion 

Note that computing in (l36l l involves taking a minimum 
over I £ 1\u U fl],z- F° r me delay-constrained wireless exam- 
ple, this involves selecting one of the K + N queues to serve, 
an operation with complexity that is linear in K+N. However, 
the minimization must be done for every entry of the J vector, 
the size of which is equal to the cardinality of the set Z (which 
is geometric in the number of delay-constrained queues K 
but independent of the number of delay-unconstrained queues 
N). This illustrates that we can solve problems with a very 
large number of delay-unconstrained queues, provided that the 
number of delay-constrained queues is small. 

D. Sampling From the Past and Delayed Queue Analysis 

It remains to be seen how one can obtain the required i.i.d. 
samples without knowing the probability distribution for u>(t). 
One might consider an online computation that obtains the 
samples by stepping forward in time. However, this requires 
a longer renewal interval to amortize the cost of learning the 
new samples, and hence creates additional congestion in the 
actual (delay-unconstrained) queues and in the virtual queues. 
In this subsection, we describe a technique that uses previous 
samples of the u>(r) values. This method maintains smaller 
renewal intervals and hence smaller bounds on the queues we 
are stabilizing. 

We first obtain a collection of W i.i.d. samples of u>(t). 
Consider a given renewal time t g , and suppose that the 
time t g is large enough so that we can obtain W samples 
according to the following procedure: Let uj\=w(t g ). If we 
have a type-2 definition of renewals, we define: ui2=uj(t g — 1), 
uj3=u(tg — 2), . . . ,%4w(t g — W + 1). Because ui(t) is 
i.i.d. over slots (and because our type-2 renewals are chosen 
completely randomly), it is easy to see that {vi, . . . , low} 
form an i.i.d. sequence. If we have a type-1 definition of 
renewals, we must be more careful in obtaining our samples, 



as the renewal times are not random but depend on past control 
decisions, which are correlated with the samples themselves. 
Nevertheless, we can begin finding samples at the last forced 
renewal event, and sample backwards in time from that point. 

A subtlety now arises: Even though the {uj\, . . . , u>w} 
sequence is i.i.d., these samples are not independent of the 
queue backlog &(t g ) at the beginning of the renewal. This 
is because these values have influenced the queue states. This 
makes it challenging to directly implement a Robbins-Monro 
iteration. Indeed, the expectation in (|35l l can be viewed as a 
conditional expectation given a certain queue backlog at the 
beginning of the renewal interval, which is &(t g ) for the gth 
renewal. This conditioning does not affect (l35l l when ui(t) 
is chosen independently of initial queue backlog, and so the 
random samples in the Robbins-Monro iteration (f39l > are also 
assumed to be chosen independent of the initial queue backlog, 
which is not the case if we sample from the past. 

To avoid this difficulty and ensure the samples are both i.i.d. 
and independent of the queue states that form the weights in 
our stochastic shortest path problem, we use a delayed queue 
analysis. Let t sta rt denote the slot on which sample ojyy is 
taken, and let &(t sta rt) represent the queue backlogs at that 
time. It follows that the i.i.d. samples are also independent 
of ©(tstart). Hence, the bounds derived for the iteration 
technique in the previous section can be applied when the 
iterates use &(t start ) as the backlog vector. Let J&(t g ) denote 
the optimal solution to the problem ( f33l > for a queue backlog 
®{tg] at the beginning of our renewal time t g , and let 
J&itstart) denote the corresponding optimal solution for a 
problem that starts with initial queue backlog ®(t start)- Let 
F denote the number of slots between t sta rt and t g (so that 
tstart + F = t g ). For type-2 renewals, F = W — 1. For 
type-1 renewals, F = H + W — 1, where H is a geometric 
random variable with mean l/4>. Because there are only F 
slots between time t sta rt and t g , and the maximum change in 
any queue on one slot is bounded, we want to claim that the 
expected difference between these vectors is bounded. This is 
justified by the next lemma, which bounds the deviation of 
the optimal costs associated with two general queue backlog 
vectors. 

Let ©i and ©2 be two different queue backlog vectors, 
and let J@ 1 and J© 2 represent the optimal frame costs 
corresponding to ©i and ©2, respectively. Define the constant 
(3 as follows: 

/3=sup||c 01 (7,n)-C0 2 (7,O)|| d (43) 

where c@(J, f2) is the vector, indexed by z, with the zth entry 
given by (f3Tb using backlog vector 0. Note from ( f3Tb that 
(3 is independent of V (as the V term in d3TT l cancels out in 
the subtraction), and is proportional to the maximum penalty 
value times the maximum difference in any queue backlog 
entry in ©i and its corresponding entry in ©2. Thus (3 is 
also independent of the actual size of the backlog vectors, and 
depends only on their difference. 

Lemma 5: For the vectors ©i and ©2, and for the [3 value 
defined in d43l , we have: 
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(a) The difference between J@ 1 and J© 2 satisfies: 

\\J&! - J@ 2 \\d < -r 

(b) Let I\{t) denote the policy decisions at time t under the 
policy that makes optimal decisions subject to queue backlogs 
©i, and define J™ s as the expected sum cost over a frame 
of a mismatched policy that incurs costs according to backlog 
vector ©2 but makes decisions according to I\{t) (and hence 
has the same frame duration and decisions as the optimal 
policy for ©i). Then: 

J &2 < J£f< Jex+l- 

<P 

where 1 is a vector of all 1 values with the same dimension 
as J 01 . 

Proof: See Appendix G. □ 
The above lemma shows that if we use the Robbins- 
Monro iterations on &(t sta rt), then we achieve a vector 
that is close to J&(t 3t a,rt) (according to the bounds given in 
the previous section), which is bounded by a constant from 
J®{t g ), where the constant does not depend on the size of V 
and depends only on the maximum difference between queue 
backlogs &(t start) and ®{t g ). Similar reasoning shows that 
the implementation of (l32l can use any queue backlogs 
that are close to the queue backlogs at the start of the frame, 
including using queue backlogs & (t) that are updated on each 
slot in the frame. While such implementation leads to a larger 
theoretical bound, in practice it may improve performance by 
allowing a faster reaction to emerging queue backlogs Q n (t). 

V. Optimizing Convex Functions of Time Averages 

Here we describe how optimization of convex functions 
of time averages can be achieved using the same framework 
of the previous sections. Specifically, we extend our method 
of auxiliary variables and flow state queues, developed for 
stochastic network optimization in [21] [1], to this Markov 
modulated network context. Consider the same network model 
as described in Section [II] Define x(t) as a vector of penalties 
for m £ {1, . . . ,M}: 

where x m {t) — x m (I(t), Q(t),S(t)). For a given policy I(t), 
define the following t-slot time average: 

x(t)±~J2E{x(r)} 

T = 

Rather than considering the stochastic network optimization 
problem (|2}-<SJ, we consider a more general objective. Let 
f(x), hi(x), h.2(x), . . . , }il{x) be a collection of continuous 
and convex functions of x £ TZ M (for some positive integer 
L). Define C={1, . . . , L}. The generalized problem is: 

Minimize: limsup^^ f(x(t)) 

Subject to: limsup^^ hi(x(t)) < ci for all Z £ C 
Q n < oo for all n e TV 

where ci are arbitrary constants. 



This general objective is similar to the objectives of [1] 
[20] [21] which treat networks without the Markov modulated 
z(t) variable. For simplicity of exposition, let us assume in 
this paragraph that all limits are well defined, and use x to 
represent lim^oo x(t). With this notation, we can re-write the 



problem as: 

Minimize: f(x) (44) 

Subject to: hi(x) < cj for all Z £ C (45) 

Q n < oo for all n £ N (46) 



This problem can be transformed as follows: Define 
7(i)=(7i(i), . . . , 7m(£)) as a vector of auxiliary variables 
(one auxiliary variable 7,„(t) for each penalty m £ A4). On 
each slot t, j(t) can be chosen as any vector that satisfies: 

x™ n - a < 7m (i) < xZ ax +a VmeM (47) 

for some fixed value a > (where choosing a > is 
sometimes useful for allowing slackness conditions). It is easy 
to see that the above problem is equivalent to the following: 



Minimize: ffr(t)) (48) 

Subject to: hi{j{t)) < q V/ £ C (49) 

7 m = w m Vm £ M (50) 

Q„<ooVneAA (51) 

where the time averages are defined: 



1 *-! 

/( 7 «) ^ lim -£E{/( 7 (t))} 

t— >oa t ' — ' 

T=0 

MtW ^ limif>{ft,(7(r))} 

t— >oo t z — ' 

T=0 

This equivalence can be briefly explained as follows: Let x(t) 
be the penalty vector used over time t under the optimal policy 
for the problem (I44b-(|46T>. and let x be the time average 
(assumed to exist for simplicity of the discussion in this 
paragraph). Then we can use the same policy, together with 
the auxiliary variable decisions ~f(t) =x for all t, to achieve 
the same penalty value and achieve the desired constraints in 
the new problem d48l-(l5lT>. Therefore, the minimum value in 
the new problem is less than or equal to that of the problem 
(I44"i>-(l46i>. On the other hand, any policy that optimizes the 
new problem can be found, by Jensen's inequality for convex 
functions, to also satisfy the constraints of the original problem 
(I44l-(|46]>. and to have a minimum value that is greater than 
or equal to the optimal value of the problem (f44l>-(t46b- 

The new problem (l48l-(l5lTl is of the form described in 
the previous sections of this paper, with the exception of 
the linear equality constraint (T50b - There are several ways of 
treating this equality constraint. In the case when the functions 
/(7) and hi(-y) are non-increasing in each entry j m , we can 
replace the constraint with j m < x m for each rn £ M 
(as in [1]). If the non-increasing property does not hold, 
we can approximate each linear equality constraint with two 
linear inequality constraints constraints j m < x m + e and 
%m < 7m + f° r some small value e > that allows the 
required slackness assumptions (Assumption 1) of Section llLEl 
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to hold. A more elegant solution, which does not require an 
approximation, uses a generalized virtual queue (which can 
possibly take negative values) of the form: 



W m {t + 1) = W m (t) - lm{t) + X m (t) 



(52) 



Stabilizing this virtual queue W m {t) can be done via a 
Lyapunov function in a manner similar to stabilizing the other 
queues Q n {t) and Y m (t), and ensures that the linear equality 
constraint is satisfied. This approach is used in [27] in the 
case without the Markov modulated z(t) variable. In the next 
section we combine this approach with our framework of 
Markov modulated networks with variable length frames. 

A. The Generalized Weighted Shortest Path Algorithm 

We have queues Q n (t) for n £ M with dynamics given 
by (Q~|i. For each I 6 C, define queues Yi(t) to enforce the 
constraints of ( [49) : 

Yt(t + 1) = max[Y l {t)-c l + h l {<y(t)),0\ (53) 



We use the queues W m (t) in (l52l to enforce the constraints 
(150) . Define @(t)A[Y(t); Q(t); W(t)} as the combined queue 
vector, and define the following Lyapunov function: 



Qn(t) 2 + Y m(t) 2 



lec 



W t (tf 



Consider any definition of a renewal. Let t g be the start of a 
renewal time, and let T be its duration. Define: 

A T (0(t 9 )) A E {L 2 (&(t g + T)) - L 2 (G(t g )) | ©(«,)} 



Lemma 6: If t g is a renewal time and T is the duration until 
the next renewal, we have: 

A T (e(t fl )) < B 2 + D 2 (&(t g )) (54) 

where B 2 is a positive constant that satisfies for all t, ®(t g ): 

E{T 2 |0(t o )} " 



B 2 



J2lVn{t) 2 + R n {tf 



E (7 m W-* m W) 2 +E(M7W)-Q) S 
meM lec 

and D 2 (&(t g )) is defined: 

£> 2 (0(t s ))4- E Qnfe)E jE d «^+^)l ©fe)) 



. T = 



"T-l 



E W m (t ff )E ^ £>m(* fl + T ) - + T )l I Q (*ff) 



. T = 



'T-l 



- E *K* S )E ^ E t Ci - KM** + r ))l I Q (*fl ) < 55 > 

(G£ It=0 J 

As before, our goal is to control the system over frames 
defined by renewal intervals starting at time t g (and having 
duration T) by choosing I(t) and 7 m (£) subject to d47T > to 
minimize the following drift-plus-penalty expression: 

D 2 {®{t g )) + V¥.^J{ 1 {t g +r))\ 0(t s )| (56) 



The Generalized Weighted Shortest Path Algorithm: 

Fix control parameters V > and a > 0. At time t g , which 
is the beginning of the gth renewal interval (where to = 0), 
observe queue values ®{t g ) and perform the following: 

1) Compute 7 g = (71^, . . . ,7m, s ) that solves: 

Minimize: Vf(j) - J2 m eM W m (t g )j m 

Subject to: x™ n - a < j m < x™ ax + a Vm E M 

2) Choose j(t g + r) = 7 g for all r € {0, . . . , T g - 1}. That 
is, use the fixed value ~f g for the full duration of the gth 
renewal interval. 

3) Choose control actions I(t g + r) G 1u(t +r),z(t +t) over 
the renewal interval according to the optimal actions that 
solve the weighted stochastic shortest path problem 



with y(t g + t) = To- For each timestep t g + r, update 
the actual queues Q n (t g + T) according to ((TJ and update 
the virtual queues Yi (t g + r) and W m (t g + r) according 
to 63) and (13. 
Note that the optimal solution to the weighted stochastic 
shortest path problem (l56l l has constant auxiliary variables 
-y(t g +t) = 7 over the course of the renewal interval, where 
7 g is computed in step 1. If the functions fipf) and hi(-f) are 
separable in each entry j m , the computation of step 1 reduces 
to finding, for each m € A4, the minimum of a single-variable 
convex function over a closed interval. 

B. Structure of the /(•) and hi (•) functions 

We assume the functions f(j) and ^(7) (for / 6 £) are 
convex over the set of all 7 vectors that satisfy (|47| |. We further 
require the following mild assumptions. 

> There are finite bounds f m in an d fmax such that: 

fmin < f(l) < fmax whenever 7 satisfies (g7) 

• Functions /(t), ^(t) are Lipschitz continuous. In par- 
ticular: 

M7i) ~ ^(7 2 )l, |/(7x) - /(7 2 )l < /9||7i " 7 2 || 
whenever 7 X ,7 2 satisfy (|47] i, where /3 is some finite 
constant that is independent of 7i,7 2 . 
In the special case when f(j) and ft./ (7) are differentiable, 
then the constant (3 is determined by the maximum gradient 
norm over the closed set defined by d47l >. 

C. Analysis of the Generalized Algorithm 

We assume optimality of the problem (I44li-(l46ll can be 
achieved by a (z, i7)-only policy that has optimal time average 
penalty vector x. Define j opt as this optimal time average 
penalty vector, and define f opt =f(j opt ). Note that optimality 
is still measured with respect to assumed 0-forced renewals. 

Assumption 3: There is a (z, 57)-only policy I* if) such that 
for every renewal time t g we have: 



E 



E{T*} 
E{T*} 



= 7°f* for all meM (57) 



> for all n e Af (58) 
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where T* is the renewal interval duration under policy I* it), 
x* n (t) and d* (t) correspond to policy I*(t), and where ~f opt = 
( 7l opt , . . . , 7 °f ) satisfies /( 7 ^) = and: 

hi(Y pt ) <ciVl€C 

We can also impose an additional slackness assumption 
(similar to Assumption 1) to allow for possible approximate 
implementations of the weighted stochastic shortest path pol- 
icy, and to prove strong stability of all queues. However, for 
simplicity, below we state the performance theorem under the 
assumption that the exact stochastic shortest path solution to 
( |56l > is used every renewal interval. Further, rather than proving 
strong stability for the virtual queues, we prove a weaker mean 
rate stability result. 

Theorem 3: (Performance of the Generalized Weighted 
Shortest Path Algorithm) Consider any valid definition of a 
renewal event. Suppose that Assumption 3 holds. For any fixed 
V > 0, a > 0, if the optimal solution to d56l l is implemented 
every renewal interval, then: 

(a) For all n 6 JV, I 6 C, m 6 A4 we have: 



lim 

t — >oo 



E{Q„(t)} 



lim 

t — >oo 



E{y,(t)} 



t t-KXJ t 

(b) For all I € C we have: 



Um E{|^)l} =0 

t^oo t 



lim sup hi(x(t)) < ci 

t — >oo 

where x(t)A!5^,E{x(T)}. 

(c) If renewals are type-2 (so that E{T\ ®(t g )} = 1/0), 
then the time average cost satisfies: 

li ms up/(*(*))</ opi + ^ 

t^oo V 

(d) If renewals are type-2 and Assumption 4 additionally 
holds for a slackness value e > (where Assumption 4 is 
formally stated below), then all queues Q n (t) for n S A/" are 
strongly stable, and for any positive integer G: 



1 ^ 



G 



g=0 n£j\f 



~\~ V if max fmin) 



0E{L(0(O))} 



The Assumption 4 used in part (d) is given below. 

Assumption 4: There is a value e > and a vector 7* = 
(7*, •••,7m) together with a (z,f2)-only policy f (t) (not 
necessarily the same policy as in Assumption 3) such that for 
every renewal time t g : 



E 



{Er=0 la 4(*9 + T )} 

E{T*} 
E{T*} 



= 7^ for all m 6 M 



> e for all n G 



where T* is the renewal interval duration under policy I*(t), 
x^it) and d^it) correspond to policy I* it), and where 

hih*)<ci yiec 



Proof: Using d54| i yields for any renewal time t„: 

A T (0(t s )) + FE|^/( 7 (f fl +r))|0(f fl )|<S 2 

+D 2 (®it g )) + ve \ fh(t g + r)) I ®{t g )\ 



where we assume we are using the stochastic shortest path 
policy that minimizes (156b . and T is the resulting renewal 
time under this policy. By definition we thus have: 

A t (0(^)) + Ue|^/(7(< 5 + t))| 0(< 5 )| <B 2 

+D* 2 (@(t g )) + VE I £ fiY(t g + r)) I ®(t g )\ (59) 

where D 2 (®it g )) represents 05] ) for any other policy J*(i) 
and and T* represents the corresponding renewal dura- 

tion under this other policy. Now let /* (t) be the stationary and 
randomized policy of Assumption 3, and let J*it g + T) = 7 op * 
for all t g + t in the renewal interval. We thus have (using the 
definition of D%(Q(t g )) in ©): 

A T (®(t g )) +™{p f(j(t g + t)) 1 e(t fl )| 



< S 2 + FE{T*}/ opt 



'T*-l 



^0„(i 3 )E J] <(t s +T)|ey 



T=0 

fT'-l 



- ^ ^ m (t 9 )E 53 ft 



m m 



it g + T)]\®(t g 



mEM 



where we have used the fact that hii~f opt ) < q for all I e C. 
Plugging in the expressions (l57l i and (l58l l yields: 



"T-l 



A T (0(t 9 )) + ^E ^ /( 7 (i 9 + r)) I ®it g 



. T = 



< B 2 + VE{T*}f opt 



(60) 



(Proof of part (a)): Because first moments of renewal 
intervals are bounded by mi (by the renewal requirements 
in Section Hl-El >. and because / (7) is bounded, the inequality 
(l60t yields the following for all renewal times t g : 

A r (e(t s )) < B 3 

where B3 is a finite positive constant. Taking expectations of 
the above inequality and summing over g £ {0, 1, . . . , G — 1} 
(where G is any positive integer) yields: 

E{L(0(t G ))} < GB 3 +E{i(0(O))} 

Because the Lyapunov function L(@(tc)) is a sum of squares 
of queue backlog, we have for any particular queue Q n (ta) 
(for n e Af): 

E{Q„(i G )} 2 <E{Q„(t G ) 2 } < GB 3 +E{i(0(O))} 
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Taking square roots and dividing by tc yields: 



MQnfo?)} < VGB 3 +E{L(&(0))} 
to ~ t G 
Taking a limit as G — > oo and noting that to > G shows that: 

lixn E{QM} = 

While the above limit only samples Q n (t) at renewal times, 
it is not difficult to use the fact that mean renewal times are 
finite to show: 

E{Q n (i)} 



lira 

t — >oc 



t 







This derivation holds for all other queues Yi (t) and W m (t) in 
the Lyapunov function, proving part (a). 

(Proof of part (b)): Because queues Yi(t) and W m (t) are 
mean rate stable and have dynamics given by ( f53l l and ( l52i i. 
the constraints they enforce hold [18]. In particular, we have: 

1 

limsup-^E{/ l; ( 7 (T))} <c ; (61) 



t-1 



T = 



lim sup - ^2 E {x m (t) - 7 m (r) } =0 



(62) 



r=0 



Now define: 



7 m (t) = i^E{ 7m (r)} 



T=0 



Define a(t)=7(t) — and note that (l62l implies that 

a(t) — * as t — ► oo. By Jensen's inequality for the convex 
function /ty(-), we have: 

^E{^(7(r))} > M7(*)) > W*)) 

r r=0 

where we have used the Lipschitz continuity property of hi(-). 
Taking a limit as t — > oo and using doTt proves part (b). 

(Proof of part (c)): Recall that (l60l > holds for all renewal 
times t g for g £ {0,1,2,...}. Taking expectations and 
summing over g € {0,1, . . . ,G — 1} (for any positive integer 
G) yields: 

E{L(0(t G ))} - E{L(0(O))} + FE I £ /(-y(r))| < 

GB 2 + ^GE{T*}/ opt 

Using non-negativity of the Lyapunov function, rearranging 
terms, and using E{T*} = l/</> (because we have type-2 
renewals) yields: 

r t G -l ~\ 



1 E|g«TW)}</--^fiW52!)» 



Taking a limit as G — > oo yields: 



UG 



tG-l 



lim SU p^E<| ^ /( 7 (r)) [ < / opt 



V 



However, as in the proof of (f82b in Appendix D, the above 
limit implies: 



t-i 



lim sup i ^ E {/(7(r))} < / opt + ^ (63) 



T = 



By using Jensen's inequality on the convex function /(7(r)) 
we have for any time t: 

^f>{/( 7 (r))} > /(7(*)) > mt))-P\Ht)\\ 

Taking a lim sup and combining with d63l completes the proof 
of part (c). 

(Proof of part (d)): Using the policy I*(t) from Assumption 
4, and using ~f(t g + r) =7* from Assumption 4 for all r, 
from (|59l l we have: 

A T (0(t 3 )) +Ve|^ /( 7 (i 9 +r))| 0(i 9 ) | 
< B 2 + Vf 

in ax 



Because all renewals are geometric with probability <fi, the 
second term on the left hand side is at least f m in/<t> and hence: 

M©(*fl)) < B 2 + V(f max - f ml n)/4> ~ Qn(t 9 )t/0 

n£Af 

The above holds for all renewal times t g for g £ {0, 1,2,.. .}. 
Taking expectations, summing, and dividing by G yields: 

E{L(0(t G ))}-E{L(0(O))} 



G 



< B 2 + V(f max - !rnin)l4> 



g=0 neAT 



Rearranging terms and using non-negativity of L(&(ta)) 
proves part (d). □ 

VI. Conclusions 

We have developed an approach to the constrained Markov 
Decision problems associated with a small number K of 
delay-constrained wireless users and a (possibly large) number 
N of delay-unconstrained queues. Optimization of general 
penalty functions subject to general penalty constraints and 
queue stability constraints is treated by reduction to an online 
(unconstrained) weighted stochastic shortest path problem 
implemented over variable length frames. This generalizes 
the class of max-weight network control policies to Markov- 
modulated networks. The solution to the underlying stochastic 
shortest path problem has complexity that is geometric in the 
number of delay-constrained queues K, but polynomial in the 
number of delay-unconstrained queues N. Explicit bounds 
on the average backlog in the delay-unconstrained queues 
were computed and shown to be polynomial in N + K. The 
average size of the virtual queues (which enforce the penalty 
constraints) were also bounded, which shows that convergence 
times required to achieve the desired time averages are also 
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polynomial in N + K. A Robbins-Monro approximation tech- 
nique, together with a delayed queue analysis, was shown to 
provide an efficient online implementation in the case when the 
system probabilities are not known in advance. The solution 
technique is general and extends to other Markov modulated 
networks with general penalties and rewards. 

Appendix A — Proof of LemmaQ] 

We begin with a preliminary lemma (proven at the end of 
this section). 

Lemma 7: For any time t and any integer T > we have: 

T-i \ 2 /T-l \ 21 



E 

nEAf 



\T=0 



E (ekt-^+^i) 

mEM \t=0 / 



< TV 



where a 2 is defined in H9\ . 

We now use Lemma [7] to prove Lemma Q] From the queue- 
update equation (Q~|) we have for any queue n S M and any 
renewal time t g (see, for example, [1] [28]): 



}n {t g +T) < max 



T-l 



Qn(tg) - E ^ n ^9 + T )'° 



T = 



T-l 



Rn{tg + T) 



(64) 



Squaring the above equation and using the fact that: 



q{H - r) (65) 



i(max[g- M ,0]+r) 2 -^<^i^ 
for any non-negative values q, /.i, r yields: 

\Qn{tg+T) 2 ~\Q n {t g ) 2 < 

(T.^o ^(t a +r)) 2 + (E^ 1 R n {t g +r)f 
2 

-Qn{tg) Er=0 M* 9 + T ) ~ R ^g + t)] (66) 

For the Y m (t) queue, from (15[ we have: 

Y m (t + 1) = max[r m (t) - otmit), 0] 

where we define a m (t)=x^ — x m (t). Similar to d66b . we have 
the following lemma (proven at the end of this section). 
Lemma 8: 

T— 1 

l -Y m {t g + Tf < ^Y m (t g ) 2 -Y m (t g )Y,am(.t g + T) 



T=0 



/T-l 



\ ( E i"™^ + r )i 



(67) 



Combining (1661 1 and J67t , using Lemma |7] and taking 
conditional expectations yields: 

A T (&(t g ))<B- y m {tg)AY J a rn(tg+T)\ &(t g )\ 

mEM I t=0 J 

- E { EM* S + T ) - Mtg + r)} \&(t g )\ 



This proves Lemma Q] 

It remains only to prove Lemmas [7j and [S] 

Proof: (Lemma [7J Define the following vector /3(r): 

/3( T ) = [(Mn( T ))neA^, (-Rn.(r))„eAr, (|a m (T)|)meM] 

Denote by A the set of all possible values that can be achieved 
by the vector /3(r) for a single timeslot r (considering all 
possible control actions and all possible random outcomes 
17(t)). Note that for any time t and any integer T > 0: 

1 T_1 

- 5^/3(t + T) S Conv(A) (68) 

T=0 

where Coni>(y4) is the convex hull of the set A. Now define 
the following convex function: 

f(f3(r))^J2^(r) 2 +R n (r) 2 }+ ]T \a m (r)\ 2 



mEM 



Note also that f(Tf3(t)) = T 2 f(/3(t)). Thus: 



/T-l 



/ E^+ r ) 



r 2 / 



< T 2 sup /(/3) 
= T 2 sup/(/3) 



/3e.4 



< T 2 cr 2 



(69) 
(70) 
(71) 

where d69l follows by (1681 1. d70b follows because the supre- 
mum of a convex function over the convex hull of a set A 
is the same as the supremum over the set A itself, and d7Tl ) 
follows by definition of c 2 in H9) . □ 
Proq/: (Lemma IS) Recall that Y m (i + 1) = max[Y m (i) - 

&m 

(t),0}. Define: 

< os (i) 4 m ax[a m (t),0] 



a 



(*) = 



i(*)»Q] 



so that cC s (i) > 0, a™ a (t) > 0, and: 

a ro (t) - < os (t) 
\a m (t)\ = a% s (t)+ar(t) 
Similar to d64b . it is not difficult to show that: 

T-l 

Y m (t)-J2cC s (t + T),0 



Y m (t + T) < max 



T=0 



T-l 



+ E< e9 ( i+r ) 

T=0 

Therefore, using (l65l i: 

T— 1 

]-Y m (t + T) 2 < l-Y m {t) 2 -Y m {t)J2a m (t + T) 



T = 



17 



The result of inequality d67| i follows from the above together 
with the fact that: 



□ 



Appendix B — Proof that (1231 ) Implies Strong 
Stability 

The inequality ( f23l implies that for every integer G > 1: 

1 G_1 

£E E Wn(* s )}<5 (72) 

9=0 

for some finite constant q. Here we show that this implies 
Q n < oo (the same reasoning also shows that Y m < oo). 
Note from (Q]) that the queue Q n (t) can increase by at most 
a finite constant r on any slot (where r = R™ ax ). Thus, for 
any renewal time t g (with renewal size T g ) and for any r S 
{0,1,..., T 5 - 1} we have: 

Qn^s + T) < Q„(t 3 ) +rr 

Using the above inequality yields: 



'to-l 



G-l 



{ r=0 J g=0 

^ rE{T g (T g -l)} (?3) 

3=0 

Now recall that mi and m2 are finite bounds on the first and 
second moment of any renewal time T g , and these bounds 
hold regardless of past events before this renewal interval and 
hence regardless of the value of Q n (t g ). We thus have: 

E{T g Q n (t g )} = E{E{T g Q n (t g )\ Q n (t g )}} 
= E{Q n (t g )E{T g \Q n (t g )}} 
< E{Q n (t g ) mi } = mi E{Q n (t g )} 

Using this in (l73l yields: 

(t G -i \ G-l 

E<E^"( T )f - mx^2^{Qn(t g )} 

[ t=0 J g=0 

Dividing by G and using ( 1721 yields: 

't a -l 



Grm,2 



±e\ Qn(r)\<m 1 



<l - 



rm 2 



Now note that all renewal intervals are at least one slot, so 
that G < ta- Using this with the fact that queue values are 
non-negative yields: 

G-i (ta-l 1 

q E ^ w«( r » < T ^ { E Q«( r ) r < + 



The above holds for all positive integers G. Taking a limit as 
G — > oo shows that Q n < oo and proves the result. Strong 
stability of the queues Q n {t) and F m (i), together with the 
fact that these queues have finite maximum transmission rates, 
implies the time average constraints d3)-(|4]i are satisfied [18]. 



Appendix C — Proof of Theorem[2] 

Here we prove Theorem [2] Let t be a renewal time (for 
type-2 renewals), and let T be the renewal duration under the 
implemented policy. From ( 1281 and (T% we have: 

A T (&(t g ))+VE^J2x (t g +T)\ 0(t s )| <B + C 



'T-l 



+D^(&(t g )) + VE | J2 + r ) I e (*») J 

n£j\f m£M 

By definition of the weighted stochastic shortest path policy, 
we have: 

D^(&(t g )) + VE r£ x* sp (t g + t) I 0(i 5 ) j < 

D*(&(t g ))+VE\j2x* (t g +r)\®(t g )\ (75) 



where D*(&(t g )) and Xo(*s + r) correspond to any other 
control policy I*(t) that could be implemented over the 
renewal interval (note that the corresponding renewal interval 
size T does not change, so that T* = T, because type-2 
renewal events do not depend on control decisions). Using 
(1751 in {ZD together with the definition of D*(@(t g )) given 
in ([T8]) yields: 



'T-l 



A T (®(t g )) + VElJ2x (t g + T)\ &(t g )j <B + C 
+V5 + VE^J2 x o(t a + r)\ e(i 9 )| 
- ]T Q n (t g )E J d* n (t g +t)-S\ ®(t g )\ 

n£Af { r=0 J 

- J2 y m e \- 6 + Ek - + r )i i ^)| < 76 > 

m£M I r=0 J 

where we have used the following notation: 

<(r) = i ro (7*(r),0(r), Z *(T)) 
<(r) = ^(r(r),n(r),«'(T)) 

Now choose as the policy of Assumption 1 that yields 

dlOll-dn)- Plugging (|ToJ>-([TT|» directly into d76j and using the 
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fact that T = T* and E{T | 0} = 1/0 yields: 
A T (®(t g ))+VE!^2x (t g + T)\ @(t g )\ <B + C 

+V6 + VE\j2^(t g + r)\ &(t g )\ 



J2Qn(t g )(e/4>-S)- Ym(t a )(*/<f>-S) (77) 



t :nia:r 
c 



in the above inequality and 



Using the bounds x™ m and x t q 
rearranging terms yields: 

A T (0(< 5 ))+ Q n (t g )(e/ ( />-S)+ Y m (t g )(e/<f>-5) 

<B + C + V(5+ (x™ ax - a^ in )/0) 

Taking expectations, summing over all renewal events t g (for 
g E {0, 1, 2, . . . , G — 1} and to — 0) and using telescoping 
sums, non-negativity of L(-), together with the fact that 
L(0(O)) = (as in the proof of Theorem [T]) yields: 

G-l 



-y 

G ^ 



5=0 



X>{Q n (* 9 )} + J2 E i Y m(t 3 )} 



< 



B + C + V(6+ (a 



■ Zo" m )/» 



e/4>-6 

This proves (1291 . 

To prove ( f30b consider again the drift inequality d76b , but 
now plug in the following policy I*(t): Define probability 
6=S(f>/e. This is a valid probability because e > 0<5 by 
assumption. At each time t g that marks the beginning of a 
renewal, independently flip a biased coin with probabilities 9 
and 1 — 9, and carry out one of the two following policies for 
the full duration of the renewal interval: 

> With probability 9: Use the stationary randomized policy 
from Assumption 1 (which we shall call Il(t)), for the 
duration of the renewal interval, which yields (TTOb- dTTb . 

> With probability 1 — 9: Use the stationary randomized 
policy from Assumption 2 (here denoted for the 
duration of the renewal interval, which yields (fT2l-(fl4]i. 

With this policy I*(t), from we have (recall T* = T): 

»(|a.,^))< fer+l ;- ,| - r ™ 

We also have from ([Toh-dTT} and (flj)-©: 



"T-l 



OC - e) + (1 - 6)xl 



-(79) 



. T = 

CT-1 



(80) 



. T = 



Plugging (I78l-(l80]l into (l76l i and using the definition of # 
<50/e yields: 

A r (e(t 9 ))+VE{5^a:o(*a+r)| e(* fl )l < B + C 



The above holds for all times t g that mark the beginning 
of renewal intervals. Defining T = T g (the duration of the 
5th renewal interval) and taking expectations of the above 
inequality yields: 



T -l 



E \ L(&(t g + T g )) - L(@(t g )) + Mtg +r)\< 



T=0 



y Si I max _ optN yopt 



Summing over g € {0, . . . , G — 1}, dividing by VG/<f>, and 
using the fact that L(&(t G )) > and L(0(O)) = yields for 
any positive integer G: 



E{Et G =o^o(r)} 



< x, 



G/cj> ~ *° 

(B + (7)0 



opt 



v +5^+f(xr x ~<n (si) 



Because renewal intervals {T g } g < L 1 are i.i.d. geometric random 
variables with K{T g } = 1/0, we have by the Law of Large 
Numbers: 



G ~ G 



1/0 with prob. 1 



Using this together with the fact that xq(t) is upper and lower 
bounded for all r, it can be shown that (see Appendix D): 



limsup — -prr~L = lim sup - ^ E{xo(r)} 

t — »oo t 



G^oc G/(j) 

Using this fact in (T8Tb proves (l30l > 



(82) 



Appendix D — Proof of (1821 

First consider the case when x^ 1 ™ 1 > 0, so that < xo (t) < 
x max f QJ . a jj ^ G(t) be the number of renewal events that 
have occurred up to time t (not counting the renewal at time 
0). Then G(t)/t — > with probability 1. Fix a value e > 
such that < e < 0. Define the following event x(t): 



Define x c (t) as the opposite event. Then Pr[x c (t)) — ► as 
t — ► oo. If x(i) is true, then G(t) < |~(0 + e)t] and so: 



+VS + V-[9x nax + (l-9)x° 



opt] 



t < <p(0 +£ ) t ^ whenever x(t) is true 
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where we recall that tc is the time of the Gth renewal event. 
Now for any time t we have: 

T=0 { T=0 J 

- Mr)\x(t)\Pr[ X (t)} 
+xr x Pr[x c (t)] 



column vector X with dimension equal to the number of 
columns of P, we have: 

ll^lle < ||X|| e 

Proof: For every row i of PX we have by Jensen's 
inequality: 



(EjPijXj) 2 < T,jPijXj 



< 



ftr»+«)*i _1 1 
+xr x Pr[x c (t)\ 



where the final inequality holds because we have added the 
non-negative term: 



*r(«+«)t] -i 

j { E xo(T)\x c (t)\Pr[ X c (t)} 

T=0 



Therefore: 

t-i 



T = 



* rW> + e)*l " 
+j; mal Pr[x c (t)] 
Taking limits yields: 



r=0 



*G-1 



limsupiy^E{a;o(r)} < (0 + e) limsup -^E < £o( T ) i 

The above holds for all e such that < e < <f>. Taking a limit 
as e — > yields: 

1 t-i x p G -i ^| 

limsup- E^l^ol 7 ")} < </>limsup— E< ^ \ 



T=0 



G^oo 



T=0 



The reverse inequality can be proven similarly. This establishes 
(82} for the case when x^ in > 0. 

For the case x™ m < 0, we can define xq(t)=Xq(t) — x m i n . 
Then we have < xq (t) < a;™ a:E — x™ m for all r. It follows 
that: 

limsup - E {xo(r)} = </> limsup — E < Sq(t) > 



T=0 



G^oo 



T=0 



Adding x™ m to both sides of the above equality yields the 
result of i 



Appendix E — Convergence Proofs 

Here we prove Lemmas [2] [3] and [4] of Section IIV-AI We 
have two preliminary lemmas. 

Lemma 9: If P is a transition probability matrix (with rows 
given by probabilities that sum to 1), then for any random 



and hence: 

E 

Thus: 



{(Ej P«Xj?} < Ej A# {X?} < max, E {^} 



max, E |(Ej p y^) 2 } < max j E {^f } 

The left hand side of the above inequality is ||PX||g, and the 
right hand side is ||-X"||g, proving the lemma. □ 

We now show that the map ^ from (l35l l is a contraction 
with respect to the norm ||-|| e . 

Lemma 10: For any two random vectors J\ and J 2 , we 
have: 

||*Jl-*J 2 || e <(l-0)||Jl- H\e 

Therefore, letting J* denote the unique fixed point solution 
to d34l i (satisfying J* = "J J*), then for any random vector J 
we have: 



||*J- J*|| e <(l 
Proof: From d35l l we have: 



I J — J* 



*Ji=0e| min c^(I,u(t))\ + 

(1 - 0)E {4 0) (/i,^W) + P(°)(/i, w(t)) Ji I Ji, J 2 } 

where 7i represents the policy that minimizes the expectation 
in the final term (corresponding to vector Ji), and the ex- 
pectation is with respect to the random u>(t). The additional 
conditioning on J 2 does not change anything and is done to 
facilitate the next few steps. Similarly: 



* J 2 = 



min cg\lMt))\ + 

feZ[ U (t),ij,« 



(1 - 0)E {4 0) (/ 2 , + P(°)(7 2) w(t)) J 2 I Ji, J 2 } 

where 7 2 is the policy that minimizes the expectation in the 
final term (corresponding to vector J 2 ). We thus have: 



Wt),i], 



(1 - ^{cg^l, <"(*)) + P^(h,u{t))J 2 I Jx, J 2 } 

= 0Ei min c^(I,a;(t)) I + 
L- f ex [lJ(f ),i], i J 

(1-0)E{ C ^ ) (/ I , w (t))+P(°)(/ I , w (t))j 1 | Jx,J 2 } 
+(l-$E{p< >(Ji,w(f))}(J2- J x ) 

= *J X + (1 - cj>)E{P {0 \hMt))} (J2 - Ji) 
Therefore we conclude: 

*J 2 -*Ji< (l-0)Pi(J 2 - Ji) (83) 
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where we define P 1 =E{P^(I 1 ,w(t))}, and note that each 
row of Pi consists of probabilities that sum to 1 (as it is 
the expectation of matrices with that property). Similarly, by 
swapping the roles of J\ and J 2 , we have: 

*Ji - *J 2 < (1 - 0)P 2 (Ji - J 2 ) 

where P2 is a transition probability matrix (with rows that sum 
to 1) defined P 2 4E {P<°) (I 2 , w(t))}. It follows that: 



||*J a -*Ji|| e 

< (l-0)max[||Pi(J 2 - Ji)| 

< (1-0)||J 2 - JiHe 

where the final inequality follows by Lemma [9] 
Now consider the iteration: 

J b+ i = 7*^6 + (1 - l)Jb 
where 7 satisfies < 7 < 1, and where: 



P 2 (Jl" J 2 )||e] 



□ 



(84) 



Vb 



where \fr is the map of (f35T >, and {^l^ is a sequence of 
zero mean vector random variables, where each entry of r/ b is 
uncorrelated with any deterministic function of J b . We show 
that ||Jb||d and H^Hd are deterministically bounded. 

Lemma]2$ Define J max =c max /<t>. If\\Jo\\d < Jmax, then: 

(a) For all b 6 {0, 1,2,.. .} we have: 



'b d 



(b) There are finite constants T) m i n and rj max such that 
Vmin < Vb[i] < Vmax for all iterations b and all entries i. 
Further, for all b, if batches of size L > 1 are used, then: 



< 



\VminVmax\ ^ 4(c 



(1 - (t>)J ma xf 



Proof: Suppose that ||J b ||d < J max for some iteration 
6 > (it holds by assumption for b = 0). We show that it also 
holds for b + 1. By the update equations d35T l and ( TSoT l. it is 
not difficult to show that: 



(1 - <j>) J„ 



Thus: 



max 

+ (1 — 0) J m ax] + (1 — l)Jmax 
= ICmax + (1 - <fa)Jmax 

7$ Jmax ~l~ (1 07) J max — Jmax 

This proves part (a). Part (b) follows because: 

\\Vb\U = W^Jb - VJbWd < ZCmax + 2(1 ~ cl>)J m ax 

Further, if r] b is based only on one sample (so that L — 1), 
then its variance is bounded by \rjminVmax\ (see Lemma [TT] 
in Appendix F). If it is based on L i.i.d. samples, then the 
variance is reduced by a factor of L. □ 
Lemma \3} Let J b be the bth iteration of A39\l , starting with 
some initial vector J with ||Jo||e < Jmax- Assume that for 
all b we have \\r] b \\l < a 2 for some finite constant a 2 (as in 
Lemma\2\with a 2 — \r\min r )max\l L). Let J* be the optimal 
solution to H34\) , satisfying ^J* = J*. Then: 



(a) Every one-step iteration satisfies (for integers b > 0): 

\\J b+ i - J*\\l < (1 - cpifWJb - r\\\ +l 2 \\rib\\l 

(b) After b iterations we have: 



\Jb - J* 



< (l-07H|Jo-J 



7 a 2 (l - (1 - HY b ) 
0(2 - 7 ) 



(c) In the limit, we have: 



lim 1 1 Jh — J* 



< 



7(7 



6-+00 " 0(2 — 07) 

Proof: To prove part (a), we have: 

\\Jb+l - J* \\e 

= ||7*J 6 + 7r7 6 +(l-7)J 6 - J*\\l 

= max E { (7* J b + 1 r lb + (1 - 7) J b - J* ) \i] 2 } 

where the final term represents the zth entry of the random 
vector: 

7* J b + 7 T7 b + (1 - 7) J b - J* 

However, each entry i of the r/ b vector is zero mean and 
uncorrelated with any deterministic function of Jb- Thus, 
the second moment of the ith entry is equal to the sum of 
the second moment of the 7T7 b [i] component and the second 
moment of the remaining components. Thus: 



'6+1 



J 



*||2 



< || 7 *j 6 + (i- 7 )j 6 _ J*||2+7 2 ||r7 fc ||2 (85) 

However: 

|| 7 *J b + (l- 7 )J b -J*|| 2 



|| 7 *(J 6 -J*) + (l- 7 )(J b -J*)|| 2 



(86) 



< ( 7 ||*(J 6 - J*)\\e+ (l-j)\\Jb -J*\\e) 2 (87) 

< (7(1 - 0)||J b - J*|| e + (1 - 7 )||J b - J*||e) 2 (88) 
= (l-0 7 ) 2 ||J b - J*|| 2 

where (f86b follows because W J* = J*, d87l i uses the triangle 
inequality, and d88l uses the contraction property of ^ from 
Lemma [10] Combining the above with (f85l establishes part 
(a). 

To prove part (b), we have from repetitions of the iteration 
in part (a): 

\\Jb-J*\\l < (l-H) 2b \\Jo-J*\\l 

6-1 



+7 2 Ell^He(l-^) 2(b - 1 - 



i=0 



Because ||t7.J| 2 < o 2 for all i we have: 

\\Jb~J*\\ 2 e < (l-H) 2b \\J -J*\\ 2 e 



2 2 
y a 



1 - (1 - 7 ) 



21, 



1 - (1 - 07) 2 

Simplifying the above expression yields the result of part (b). 
Part (c) is an immediate consequence. □ 
We now show that an implementation that chooses I(t) over 
a frame according to (l32l . using the J b estimate instead of the 
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optimal J* vector, results in an approximation to the stochastic 
shortest path problem that deviates by an amount that depends 
on \\J b - J*\\ e . 

Lemma |4} Suppose we choose I{t) according to ( 1321 ) over 
the course of a frame, using a vector J rather than J*. Let 
J (J) represent the expected sum cost over the frame (given 
J). Then: 

2(1-0)||J-J*|| e 



and hence (because E{|X|} < ^/E{X 2 } for any random 
variable X): 

E{j(J)[i]} < J*\i] + \\J(J)-J*\\ e 



□ 



Appendix F 



\J(J)-J*\\e< 



(89) 



Further, defining 1 as a vector of all 1 entries with size equal 
to the dimension of J, we have: 

2(1-0)||J-J*|| e 



;{j(j)} 



< j* + i- 



where the expectation above is with respect to the randomness 
of the J vector. 

Proof: Let I(t) represent the control decision on slot t 
made using the J vector, and let I* (t) represent the decision 
that would be made under the J* vector. Then: 



Here we state a simple and well known bound on the 
variance of a bounded random variable. The proof is provided 
for completeness. 

Lemma 11: (Maximum Variance of a Bounded Random 
Variable) Let X be a random variable such that x m in < 
X < x max with probability 1, where x m i n and x max are 
finite constants. Then: 

(a) The variance, denoted by Var(X), is finite and: 



Var(X) < 



(91) 



J{J) = (jM\ min c£>(I,u{t)) 

+(i - m{c%\i{t)Mt)) + p (o) (i(t),io(t))j(j) i j} 

where the expectation is with respect to the random u>(t) 
outcome. Thus: 

J{J)=4&\ min c£\l,w(t)) 

+ (l-0)E{4 o) (/(t),^))+p(°)(/(t),^))J| j} 

+ (1 - 0)E{p<°>(J(t), w(t))} (J(J) - J) (90) 

Because I(t) minimizes the second term of the above 
equality, we have: 

(1 - cb)E {4 0) (/(t), W (t)) + pW(/(t), w{t))J | j) 

< (i-^[4\i*(t),u(t)) + p(°\r(t),uj(t))j\ j] 

= (1 - cb)E {4 0) (I* (t), w (t)) + P<°) (P (t), u(t))J* | j} 

(i-0)E{p(°)(r(t), w (t))(j- j*)| j} 

Combining the above with d90T l yields: 

J(J) < J* + (1 - 0)E{p(°)(J(t),w(t))} (J(J) - J) 

+(i-0)E{p(°)(r(t) jW (i))}(j- j*) 

However, we also know that J* < J(J). Therefore, using 
the fact that the expectation of a transition matrix is also a 
transition matrix, and that ||PX|| e < ||X|| e : 

\\J(J) - J*\\ e < (1 - 0)|| J - J*|| e + (1 - 0)|| J(J) - J|| e 
< (1 - </>)[||J ~J*\\e+ \\J(J) ~ J*\U + \\J* ~ J\\e] 



Further, the above bound is tight and is achieved by the 
following extremal distribution: 

Pr[X = x mm ] = Pr[X = x rnax ] = 1/2 

(b) If X additionally has a known mean X, then the variance 
satisfies the following tighter constraint: 



Var(X) < (x r 



X)(X 



(92) 



Further, the above bound is tight and is achieved by the 
following extremal distribution: 



Pt\X ^min] 



x 



Xmax Xmin 
X Xm.in. 



X ma x Xmin 

Note that in the special case when X = 0, the above lemma 
implies Var(X) < \x m axX mm \. 
Proof: (Lemma fTTb We have: 

Var(X) = Var(X - x min ) 

IE { (-^ %"min) } (-^ ^min) 
= [%max - Xmin)(X - X min ) - (X - X 

min ) 

{X Xmin)i^max X) 

where the inequality in the above chain of expressions follows 
because < (X - x min ) < (x max - x min ) with probability 
1, and so (X — x m i n ) 2 < (x max — x m i n )(X — x m i n ) with 
probability 1. This proves d92l in part (b). 
To prove ( l9Tb in part (a), we have: 



Var(X) < (X - Xmin)(x r , 



X) 



I119,X (x Xmin)(%max X) 



where the final inequality follows because x m in < X < x m ax- 
By taking a derivative with respect to x, it is easy to show that: 



Rearranging terms yields 

To prove the final part of the lemma, note that for each 
entry i we have: 

J(JM< r[i\ + \J(Jm-r[i\\ 



max (x 



I ) [p^Ti 



0> 



This proves d9TT >. That the bounds d9Tb and J92t are achieved 
at the given extremal distributions is easily verified. □ 
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Appendix G - Proof of Lemma[5] 

Here we prove Lemma [5] of Section IIV-DI restated below 
for convenience. 

Lemma\5$ For the vectors ©i and ©2, and for the f3 value 
defined in j43j , we have: 

(a) The difference between J& 1 and J& 2 satisfies: 



'®2 



< 



(b) Let I\(t) denote the policy decisions at time t under the 
policy that makes optimal decisions subject to queue backlogs 
©i, and define J™ s as the expected sum cost over a frame 
of a mismatched policy that incurs costs according to backlog 
vector ©2 but makes decisions according to I\(f) (and hence 
has the same frame duration and decisions as the optimal 
policy for ©1). Then: 



J& 2 < 3^ s < J& 1 



1? 



where 1 is a vector of all 1 values with the same dimension 
as J@ x . 

Proof: By definition, we have J© 2 < J™™ (as J© 2 is the 
minimum sum cost over any policy when penalties are incurred 
according to ©2 queue backlog). Consider any entry z, and 
suppose we start in initial state z(0) = Let T\ denote the 
renewal time under policy Ii(t), and let zi(t) denote the state 
at time r under policy I\(t). Then: 

J*M < J™M 

(Ti-l ~\ 



= E ■ 



]T c &2 (Ii(r)Mr),zi(r)) 



T = 



= J*M+*\jL, c © 2 ( J i to Mr), zi (r) ) I 
-E|^ C e 1 (/i(r),fi(r) J zi(r))| 

where the final inequality is due to the fact that the mean 
renewal time is at most l/(f>. This proves part (b). 
To prove part (a), note that part (b) implies: 

& 

J& 2 < JQj + 1 — 

However, switching the roles of ®i and ©2, we can similarly 
derive J® 1 < J© 2 + 10/<j>. This proves part (a). □ 
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